When I talk to people about DevOps, I hear a lot of opinions—mostly about automating everything and deploying a lot.
The problem is, if you deploy a lot but keep the same buggy practices in place, the team will be deploying buggy code all the time. The bugs will escalate faster than the new changes, and the fixes will introduce new bugs. Eventually, someone is going to declare defeat, and the team will put in some “process rigor” or “discipline” or something else. DevOps will have a bad reputation at the shop, and introducing the concepts will require a new name the next time it gets proposed.
It shouldn’t be like that, and it doesn’t need to be like that.
The main problem I see is that people begin with values. One classic value for programmers is that code is good. If you start with that value, then software-defined networks, software-defined infrastructure, and test tool automation are all inherently good because they are code. Add that value and a superficial understanding of DevOps, and you get the idea that DevOps is all about automating things.
DevOps does have a strong element of applying good programming concepts to other disciplines, like operations, rollout, creating new servers, and, yes, some amount of test tooling and infrastructure. The core idea, though, is about the various roles working together to create a delivery (and software) system that is more stable.
I’d told people that. I’ve sent them articles to read about it. But the best way I have seen to explain DevOps is by either actually doing it or doing a simulation. Given that doing without understanding will introduce the failures I mentioned earlier, a simulation might be the better way to go.
My friend Noah Sussman developed a game to do just that—the Abelian Sandpile Game.
Building Software in an Abelian Sandpile
For the curious, the Abelian sandpile is a mathematical simulation of a pile of sand. As you add sand to the pile, it grows taller until the pile cannot support the weight, and then the base expands. I think it’s an apt example to simulate what happens on software projects.
To play, you need a sheet of a paper with a four-by-four grid printed on it, and some coins, checkers, or other small discs that can be stacked. Imagine the grid as your software model, and on that software you get to pick three deploy sites. Your job, as a team, is to deploy forty change requests on those deploy sites.
Of course, there is a catch. The software will have some bugs. When a square gets five changes (represented by the coins) on it, the software decays, becoming one coin on the original square and on the squares above, to the right, to the left, and below. You can continue to deploy, and eventually you’ll have another decay. Decay enough, and your decays will decay. Once software decays off the grid, you’ve created a major outage.
That simulation isn’t too different from what we see in software all the time when configuring operating systems, upgrades of the database, the web server, plus all the code we are writing. Often we see a cascade effect where several small bugs combine to cause a larger problem.
A Lesson in DevOps
In the first round, players deploy software to the grid with a goal of minimizing the number of serious failures—trying to have the least number of coins fall off the grid. The facilitator times the round without announcing it will be timed. In round two, the facilitator says the team has a mechanism to detect problems and rollback quickly, so players don’t need to worry about the coins that fall off. Instead, the team just needs to try to get all forty changes deployed.
The first round typically takes ten to twelve minutes to play. It will likely take longer if playing with a larger group that needs consensus, and it will be quicker with a single player. Round two might take four minutes, tops.
That’s a huge difference in throughput, which hints at the differences between traditional “catch all the problems before the release” testing and a more modern, DevOps-focused, “move fast and keep the system sustainable” approach. Teams that play the Abelian Sandpile Game experience the difference in performance for themselves.
The biggest problem I have with the real-life continuous delivery and DevOps approaches I’ve seen is they are missing that sustainability piece, referred to in engineering circles as resilience. If, instead of concentrating on reducing the time between failures, we focus on creating a resilient system—one where the cost of failure is minimized, achieved by reducing the time to recovery—we can get more done in less time, with less argument.
Please consider this an advanced beta version of the game. One thing I am not sure of is the ideal number of coins to play with—too few and the game is too easy, but too many and the game becomes somewhat of a boring death-march to inevitable defeat.
If you have tweaks to make the game more valuable, please leave a comment.