DevOps is evolving with some potentially very harmful choices embedded in it. Among these are poor adoption of sound computer science, little thought to the maintainability of DevOps code, and choices of tools based solely on productivity without concern for maintainability. Will this cause DevOps to fail to live up to its potential?
DevOps is a rapidly growing philosophy within the IT community, espousing close collaboration between software developers and system operations. DevOps promotes that everyone involved in the design and construction of software should think of the production environment as their target. DevOps is how companies like Netflix and Amazon are able to maintain and enhance their software with few hiccups while serving tens of millions of people at a time. DevOps also has been touted as the new agile—as practiced by those who “do it right,” anyway.
The refrain “They did not do it right” is often used when Scrum has been implemented in an organization but the results are not what was expected. We now understand the concept of “being agile” versus “doing agile”: To “do agile” is to mindlessly follow the practices defined by an agile methodology such as Scrum—with poor results. In contrast, to “be agile” is to truly embrace agile's intentions and apply it in a way that leads to actual increased agility, quality, and value.
Is DevOps headed for a similar struggle to “do it right”? I think the direction it’s moving might cause it to not deliver on its potential. For example, among many DevOps engineers I hear constant complaints of unreliability and a continuous stream of problems with different causes. It is also clear that DevOps tools have completely ignored the hard-won lessons of computer science. DevOps users often build homegrown frameworks of great complexity, ignoring the agile value of simplicity—a value that is also paramount to reliability. Those who choose the tools seem to make their choices based on personal productivity rather than lifecycle maintainability. Finally, DevOps teams have recreated the batch job, with all of its shortcomings.
Let’s examine these challenges in more detail.
A Pattern of Unreliability
Between 2001 and 2006 I specialized in consulting for clients who had been experiencing chronic problems with IT systems but had been unable to find a common cause to fix. The root problems had to do with systems that had been created in a hurry by very smart individuals who were key to the system’s maintenance but were assigned to other efforts, leaving the systems to grow unreliable. Over time, the frequency of anomalies would increase.
These systems were not poorly designed, but they were complex—in one case, there was a PERL loop that was nested twelve levels deep!—and wholly without comments of intent within the code.
I see the same thing developing today in IT groups that are trying to practice DevOps: lots of homegrown frameworks of great complexity, with little documentation and lots of tribal knowledge needed to maintain them. This is a prescription for great headaches down the road as the gurus who created these frameworks move on.
Lessons of Computer Science Ignored
Most of today's DevOps tools rely on Ruby. Ruby is a language designed and maintained by an individual known as Matz. Recently I had to build a multithreading application and was shocked to discover that Ruby threading is broken. Ruby has threads, but they are not truly concurrent because (as of this writing) Ruby maintains a “global interpreter lock” to serialize the actions of the threads. The threads operate like an old time-sharing system, going one at a time. Sun got Java threads working in short order. Ruby is now two decades old, and threads still don't work? To be fair, Java had the support of Sun, but that is kind of the point: Java was nurtured into a robust enterprise language, with sufficient resources to achieve that.
The problem is that Ruby—the “new PERL”—is not well-suited to building systems that stay maintainable and reliable over time. Let me explain.
- Lack of Component Contracts
By the early 1980s the computing profession had realized that the freewheeling days of C programming and bash scripts had led to unmaintainable code, and one of the reasons was that these languages had no way to break a system up into pieces with well-defined behavioral boundaries. Software engineering pioneer David Parnas called the use of such boundaries “encapsulation.” Most languages developed during the following two decades therefore had mechanisms for encapsulation, and these languages were referred to as “object-oriented.” For example, both Java and C++ allow the programmer to define “interfaces” that specify a contract an object must adhere to.
This is relevant and important for DevOps, because in DevOps one has a lot of passing around of things. For example, the tools Vagrant and Chef are widely used as a scripting framework for provisioning virtual machines. In a Vagrant script, one often sets JSON attributes that get read by Chef scripts, or “recipes.” If any attribute value is missing or set incorrectly, things will blow up—at runtime. This is exactly the kind of thing that interfaces are intended to prevent. Chef recipes also often define a kind of makeshift interface in the form of an attribute folder, but this is, frankly, so FORTAN. It reminds me of common blocks. And here we are in 2014.
- Lack of Closure
Ruby has very limited closure features. Here, I am not referring to closures, which are essentially method objects and which Ruby does have; I am referring to the computer science and mathematical concept of closure, which is important for composability, a system design principle that is critical for system security and reliability. In this context, closure refers to the ability to create a boundary around something. Closure is extremely important for maintainability because it enables program components to define their own little playground, and programmers don't have to worry that everything in the code base might be affecting the code they are looking at.
Languages such as Java, Ada, and C++ have very robust closure features. For example, in Java you can import a package and reference things in it. Ruby can do this too. But in Java, if that package itself imports another package, you can't reference things in that other package unless you use a fully qualified name. Not so in Ruby; in Ruby, things can come in from all over the place, leading to a nightmare when tracking down origins. This is not good for maintainability.
I Want Candy
Management often feels that developers should be able to choose their own tools, but that does not mean there should be no oversight. Think about what developers consider when choosing a tool: What makes me most productive? (Not “What will make others who have to maintain my code more productive?”)
Today's DevOps teams in organizations around the world are picking tools for their organizations—foundational tools. And they are choosing based on what makes them most productive, and also what is popular today. They are not picking tools based on what will be maintainable.
The Batch Job Returns
I really thought the batch job was dead. Yet I am seeing it return in the form of the Jenkins job.
A developer tests something locally on his laptop, and it works. But that does not mean it will work in the cloud, so he submits a Jenkins job. He watches while Jenkins waits to obtain a slave to execute the job. Then finally it starts and finishes. It finds a syntax error in code that did not run locally. (Ruby doesn't catch errors in code that it does not execute—one of the joys of an interpreted language.) So the developer fixes the error, pushes the code to Git, and waits for Jenkins again. It’s a batch job!
It is time to reassess where we are headed with DevOps. Are we creating things that will become unmaintainable legacy in short order?
Perhaps it is time to start demanding that maintainability be a strong factor to consider when deciding whether to develop homegrown frameworks and choosing which tools to use.
Perhaps it is time to start demanding that sound computer science principles be applied in the growing amounts of code our DevOps teams are creating. DevOps is not system administration; it is agile software development that requires system administration knowledge. Are the right people doing it? That is, people who have experience building maintainable systems?
Perhaps it is time to start demanding simplicity and focusing on how well application teams can learn the DevOps methods themselves, with the DevOps team acting as coaches and instructors instead of savants and gurus who create magic frameworks.
Perhaps it is time to not repeat the mistakes of the 1970s, but this time in the new mainframe of the cloud.
I would generally recommend not using multithreading in a cloud environment. Doing so is counter to the goal of making your app horizontally scalable. If you are doing multithreading, you scale up by adding CPUs, not by elastically adding VMs.
Good point, but I only used the threading example to illustrate that ruby has some major features that do not work as expected, and that should make the reliability of the platform suspect. For infrastructure, things need to be rock solid.
Also, contrary to popular belief, multi-threading _is_ crucial for horizontal scaling. Popular frameworks such as Node use threads in the background: it is only the entry point loop that is single threaded. This is a very old design pattern, used originally for device drivers.
I was totally following with your first paragraph describing DevOps (closer collaboration, design for production, etc.). But after that you stopped talking about DevOps and just described people making bad and sloppy decisions.
None of those bad decisions you described had anything to do with DevOps.
In fact, I would say that the intense focus on reducing lead time and improving quality that are the hallmarks of the DevOps movement would catch and call out all of those bad practices immediatey (crap code, unreliability, overly complex tools, etc.). Much like many companies use Agile in name only, I think you've been see companies who are using DevOps in name only to justify bad behavior.
I would suggest you check out conferences like the DevOps Days global series (http://devopsdays.org), DevOps Enterprise (http://devopsenterprise.io), Velocity (http://velocityconf.com), or FlowCon (http://flowcon.org) to see a large number of companies leading the DevOps charge and working in almost the exact opposite way that you described.
The focus on Ruby's thread-handling a problems for DevOps is strange. The bottlenecks I see is most organisations attempting to increase speed of delivery and improve cycle time are typically at a much higher level than an OS thread: the constraints are usually at the level of cross-programme prioritisation at the organisational level (often with a highly over-worked 'Ops' team).
There are situations in 2014 where worrying about threading is valid (high-speed, low-latency stuff like finanical markets and betting, for instance), but not where infrastructure automation and deployment are concerned: the challenges are at a higher level of abstraction.
As Damon says, there are loads of good examples of organisations taking an effective approach with DevOps. I'd like to add another suggestion: Build Quality In, a book of Continuous Delivery and DevOps experience reports: http://buildqualityin.com/ - success stories from around the world.
Just to clarify: the example of threading was only intended to illustrate the immaturity if ruby after 20 years. One expects a language to be robust and that all the features will work. The waning unreliability of tools as system components is a contributor to overall unreliability of systems. Imagine that you have a system of 100 things, and each thing is 99% reliable: how reliable will the _system_ be?
My concern is that things are getting really complicated fast, with infrastructure coding, and not enough attention is being put on maintainability and reliability. The use of dynamic languages is only one manifestation of this. It is not always feasible to have "unit testing" for devops functions, so static languages and tools amenable to static integrity (syntactic and semantic) checking would really help, but things are going in the other direction. The companies at the forefront of devops - the Amazons and what-not - have huge amounts of money to throw at it: if they need ten more folks to build and maintain custom tools, they just hire them, and they hire the best. But other organizations cannot use this strategy, and maintainability becomes a big problem, as does key person dependencies.
I meant to say "waning reliability". (No edit function here!)