Performance tuning is often a frustrating process, especially when you remove one bottleneck after another with little performance improvement. Danny Faught and Rex Black describe the reasons why this happens and how to avoid getting into that situation. They also discuss why you can't work on performance without also dealing with reliability and robustness.
In a recent running of our performance testing workshop, the students said they really appreciated learning about "hockey sticks" and "onions," so we'd like to share these concepts with you.
The Infernal Onion
When we run a performance testing scenario, we usually start with a light load and measure response times as the load increases. You would expect that the response time would increase as the load increases, but you might not anticipate the dreaded "knee" in the performance curve. Figure 1 shows the hockey stick shape of the typical performance curve.
|Figure 1: The classic hockey stick|
The knee is caused by non-linear effects related to resource exhaustion. For example, if you exhaust all physical memory, the operating system will start swapping memory to disk, which is much slower than physical memory. Sometimes a subsystem like a Java interpreter or application server might not be configured to use all available memory, so memory limitations can bite you even if you have plenty of free memory. If your CPU horsepower gets oversubscribed, threads will start to thrash as the operating system switches between them to give each a fair share of timeslices. If you have too many threads trying to access a disk, the disk cache may no longer give you the performance boost that it usually does. And if your network traffic approaches the maximum possible bandwidth, collisions may impact how effectively you can use that bandwidth.
When we tune the performance of the system, we try to move that knee to the right so we can handle increasing load as long as possible before the response time shoots off the scale. This tuning often happens near the scheduled end of a project when most of the system is functional enough to allow for system-level performance testing. When you improve the performance of the system, what should you anticipate to happen next? Sometimes you're still limited by the same kind of bottleneck, though the knee has moved and overall performance is better. Often, though, you'll uncover a new bottleneck that is now the limiting factor in your performance (shown in figure 2). It may be that you're now exhausting a different resource, or that a different part of the system is exhausting the same resource as before. Figure 2 shows a second bottleneck that was masked by the first one.
This is an application of "Rudy's Rutabaga Rule" from Jerry Weinberg's The Secrets of Consulting. The rule is "Once you eliminate your number one problem, number two gets a promotion." Maybe if you get enough bottlenecks out of the way, you can achieve your performance goals for your system. But don't get frustrated if each change to the system only improves performance by a small amount. Figure 3 illustrates why. (See below.)
If your system doesn't hog one resource significantly more than any other resource, then your bottlenecks will be stacked closely together. Removing each layer will only make a small improvement; you'll most likely slam into another bottleneck waiting nearby.
It Won't Go Fast if It Doesn't Go at All
Testing the system's performance tells us how fast each user can complete a task, and how many users it can support. A related concept is reliability, where we look at how long the system can operate before encountering a failure. You might want to devise a reliability test that doesn't step up the load the way a performance test often does. Not all projects do reliability testing, though, so you might be conducting performance testing before the system's reliability is solid. In that case, you'll usually