Working for an electric utility company is different from most e-commerce operations. We don’t really have sales or things like that to drive extreme volumes of traffic to our website. We have power outages.
Before the internet and cell phone technology, if the power went out, you lit a candle and waited for the power to be restored. Now, out come the cell phones and tablets; everyone wants to confirm the outage on the website. Perhaps it is a feeling of community to know just how many people are in the dark with you, or to make sure it is not just your house. But when an outage occurs, our web traffic explodes.
On a typical day, we have a small number of customers interacting with the website at any one time: paying bills, arranging for new services, or one of the many other things offered on our website. We rebuilt the website several years ago with the recognition that our customers wanted to do business with us on their schedule, not just when our doors were open.
We also learned that when the lights go out at your house, you want to know some basic things: How many users are impacted by the outage? What happened? When will the power come back on?
My role in this process was to find a cost-effective method to stress test our website with a high volume of customer traffic to ensure that when the power goes out, our website can handle the surge.
We own a load-testing tool that is great for testing typical loads, but it was not cost-effective to purchase 50,000 virtual users (10 percent of our customer base) or more, so we had to find another solution. We also wanted to test our entire infrastructure, including the ISP, for a surge in load.
Our requirements were pretty basic:
- Create a scenario where we can push thousands of test users against the target website
- Ensure the source of the traffic can be white-listed so security does not consider the test a DOS attack and shut the test down (this is very important, as public websites are subject to DOS attacks on a regular basis)
- Be able to collect metrics on performance as the number of users increases, including webpage breakdowns, to see not just performance, but what elements are causing the performance degradation
- Be able to run the tests on our schedule, during off hours when normal traffic is low
- Keep it cost-effective
- Make it repeatable
Finding a Stress-Testing Tool
An internet search identified many possible vendors to provide viable solutions. Many of these vendors will provide a free or low-cost trial with a small number of users (usually about 5,000) for a short test duration (one to five hours).
There are many pricing options to consider, so you need to think about how many users you really need at one time and how often you are going to run tests. You can expect several runs as you get your scenario set up and the target system properly configured to handle the planned load. You also should consider periodic retesting to ensure that none of the patches that may have been applied or other changes made in your infrastructure had an unanticipated performance issue.
After evaluating multiple vendors, we selected two for a “bake-off.” Each vendor provided a free trial for us, and we did the scripting, test execution, and results analysis. The goal was to determine how hard these tools were to use and if they could provide the value we need in a stress test.
We had each vendor supply a white list of IP addresses where the load would come from. As we are an electric utility, we selected load locations close to New Mexico (Texas, Colorado, and California) to minimize latency issues our real customers don’t have, considering a customer having an outage lives in our service area.
We worked with security to expose the QA instance of our website as the target for the load test during a selected weekend when real customer load volume was expected to be low. Prior to starting the load test, we checked to ensure no real outage or other issue was present that would make a performance-impacting load test an issue for our real customers.
As with any load test, you don’t go from zero to 50,000 users in one minute. After some trial and error, we found a correct ramp rate that allowed load to grow in a controlled fashion. Remember, we were testing at a load not done before, so some trial and error was to be expected.
During the test, we carefully monitored our servers for bottlenecks that could create performance issues, as well as the user experience in real time, to see how the test was going so we would know to stop the test if needed.
What We Learned from the Test
The tests showed us several unexpected opportunities to better serve our customers during an outage. We also were able to verify that our ISP design could handle this kind of load as promised.
The value of the bake-off allowed us to evaluate the different vendor solutions using our systems and our data and see which one was a good fit for us. In addition, we were able to demonstrate the value of this type of testing to others in company who had not considered it a viable option. Being able to generate valuable analysis of the test results was the area of greatest variability in our testing, and I would encourage you to allow time to learn the analysis tools and ask lots of questions.
Just like with any test, your results are only of value if the test actually represents how your users will use the application. You need to recognize that a user’s behavior will change based on the scenario being simulated.
A stress test like this is looking for the rare scenario where volume is many times more than the typical day. This scenario can happen in almost any business, and planning and executing this kind of test is part of a good business plan.