Almost no one has built load testing correctly. Worse yet, our collective mental models seem to be damaged. I know these claims sound preposterous on the surface, but I hope by the time I’m done, I will convince you they’re true. As someone who has built my own load test systems, I was taken aback when I discovered how flawed my thinking was. However, before I demonstrate the damage, let me start with a quick walk-through of how I think our collective model works.
We start with an environment that handles some sort of concurrent usage—perhaps a website or an API. Depending on exactly what type of load testing is being done, this might vary a little, but in general we have a static set of users who use the system over and over again. Some load test systems call these “threads” or “virtual users.” Each user generates a request, waits for a response, and then makes more requests until the test script is completed.
To generate load, the system will run the test script over and over until some “done” condition is hit. The load test script may have some customization, such as the users varying their actions a little or pausing between requests. The load may have several different testing purposes, like memory constraints or whether the system will remain stable over a long period of time. Once the test has been completed, you might examine data such as the average response and the fiftieth percentile.
While I don’t think this model of load testing is broken for all possible load test goals, it certainly is for the vast majority of scenarios I have worked on. Have you found any issues yet?
Traffic in the Real World versus a Test
Stepping back from computers for a moment, imagine you are in a car, driving on the highway. You’re cruising at the legal speed limit, and there are about nine cars nearby. We might say there are ten users of this particular stretch of highway at this particular time. We could claim that in normal conditions, the highway can process ten cars per second on the given stretch of road. Then an accident happens. You go from a four-lane highway to a single lane. You and the other cars nearby can no longer move at ten cars per second. Traffic is making it difficult to navigate, so you slow down to half the legal speed limit. Assuming that it takes an hour for the wreck to be cleared away, what do you think will happen?
In the real world, you would see a traffic jam. They might close down the highway to allow workers to clear the wreck. This certainly is going to make travel slow. Assuming that traffic remained constant, ten additional cars would appear at the traffic jam site every second, while fewer would leave.
If our load test model were parallel to the real world, it too would keep generating traffic in spite of the traffic jam. However, our load test only has a maximum of ten users it can test with. That means the system would never see the load it would in a real-world scenario. Your ten users would be going at half-speed, for sure, but the analogous additional cars would never appear. You would never see a traffic jam. You couldn’t; you only have a maximum of ten users. In a sense, your load test backs off the system, letting the system recover more quickly. After all, there’s no need to shut down the highway if you only have ten cars waiting. Azul Systems cofounder and CTO Gil Tene coined the term coordinated omission to describe the problem.
Lying to Ourselves with Data
Worse yet, the statistics you gather would be completely wrong. Imagine you have a load test running for a nice round one hundred seconds, and for each second you have a hundred transactions from a single user. That means you should see one transaction per ten milliseconds, or ten thousand transactions by the time the test is done.
That system is not only very fast, but our stats also would demonstrate how good it is. Now, instead of one hundred seconds, imagine a three hundred-second load test, with the same parameters as last time. The first hundred seconds you would have ten thousand transactions, just like the previous example. Then, you unplug the network cable for a hundred seconds. Finally, the cable is plugged back in again and the test runs just like normal. For two hundred seconds, the test ran perfectly normally, but for the middle one hundred seconds you had no transactions, meaning your total throughput was twenty thousand transactions.
What do you think the median is in our load test results? Ten milliseconds.
How about the 50th percentile? Ten milliseconds.
Maybe the 99th percentile will show the trouble? Ten milliseconds.
Perhaps with a bit more precision—say, the 99.99th percentile? Also ten milliseconds.
In most load test systems, you’d never know there was a problem, unless you look at either the max time or the total number of transactions. We don’t even know how bad it would get because we don’t know how the system would look if you properly piled up transactions, like with the traffic jam. Your load test will look stellar, but the system would actually be behaving badly. You are coordinating with the system, allowing it to rest when it is at its worst.
This might be a real-world scenario for some systems, as people do abandon webpages when they don’t load quickly. However, that sounds a lot like saying it’s a good thing long lines convince customers to not buy our product.
How Can This Be Fixed?
Now that you see the problem with our existing model, what do we do about it? While the ideal solution would be to fix the tools, that is often not practically possible. I know Gatling and wrk2 have made efforts to fix the coordinated omission issue, but not all tools are open source or technologically flexible enough, or you may not have the expertise to change the tool. Replacing the tool with one that works better is possible, but that also has costs. There is another solution.
If you can’t fix the tool, you can try to fix the data. Using HdrHistogram calculates roughly how bad things would get if you had continued to commit typical load. It does not add that additional load onto the system, but it includes the best case for what the system would have looked like with that load in the performance results. That way you have some idea of just how badly things went. This will give you a more realistic lower bound on the system load.
No matter how you fix the problem, ultimately it is our responsibility to understand what our tools do as well as what the results mean. All too often we abdicate our responsibility to the tool, letting it become the expert and just believing the results it outputs. If we don’t bother to learn a little about how our tools work, understand the internals, and, in this case, do the math, we may give our stakeholders bad advice—and, ultimately, our users a much worse experience.