System load testing tools are well known for understanding how your system will perform under different scenarios—but that is not all these tools can do. This article looks at the ability to gain critical information from your production systems using the abilities that are included in these tools.
Performance testing tools allows us to simulate load on targeted systems to better understand performance in various scenarios. These tools are designed to spot load issues, memory leaks, and many other things that can (and do) impact performance. They can also collect very detailed server-side metrics during the test to allow the team to see not only the performance from the user’s perspective, but also the server-side performance to help understand the factors behind the performance issues—and to fix them.
The classic use case for these tools is “in test,” but you can also use them on production—on live systems—to investigate performance issues when support tools may fall short.
Here are three examples from my career.
Issue #1: Sporadic performance issue for an application
If you talk with your support team about application performance, you may hear a recurring story such as, “Users are always saying the app is slow or frozen, but when we check it, we don’t see any issues.”
A few possible causes include:
- Users don’t understand how the application is supposed to work (unlikely, as these users spend all day on the application, so they should know it pretty well)
- There is some kind of intermittent issue at work causing the change in performance
- The users are doing something we did not anticipate in testing
We had this exact issue in an application used by a pool of more than a hundred users, and the support team felt there really was an issue in play but could not spot it. When they looked at the servers, the average load for an hour was fine. We decided to investigate possible causes using our performance testing tool.
We employed the following steps:
- Created a script that only did queries (This was production, so no change transactions were to be done)
- Installed the load simulator software on three of the users’ machines who were part of this team. These users were on different VLANs so we could also look for network performance as a possible cause. These machines were used concurrently by both the performance test and the actual user to determine whether the issue was related to some user action.
- Deployed three load simulator PCs to the floor, one on each VLAN, to act as another data input
- Set up the scenario to run for ten hours per day, with ten transactions per hour. The load was low enough that actual production users were not impacted by the test execution.
We had our test scenario start an hour before the employees came online and then run for an hour after the shift ended. We also collected server-side metrics in the performance tool that allowed us to see any spikes on the server at a more granular level that the admin team could not see in the hourly summary.
We found that how a customer record was queried had a dramatic impact on the time needed to return the customer information. In our case, we found the search by customer phone number (the number most of us know) was taking two to five times as long for the query to return the results. We made a code change to better handle search by phone number (index) and were able to get this lookup method to perform at the same level as the other methods.
We also found that nine hours into the test (close to quitting time), the servers got overloaded and performance really became erratic. As we knew the exact time when the load was greatest, logs could be checked, and it was found that some end-of-day reports were trying to be run by users before the day was completed. This was causing an excess amount of record locking. We automated the scheduling of those reports so the users no longer had to run them manually and scheduled them to run after the office closed. This change resolved the performance issues seen by both the users and managers.
Issue #2: How much slower is an application depending on where the user is located?
New Mexico is more than 121,000 square miles of wonderful country, and as the major electric provider, my company has locations across the state. Some of the remote locations do not have the highest speed connections possible, either because they are not available in the local community or the cost is extreme. While it is possible to calculate response time if we consider factors like latency, distance, and LAN speed, a chart showing the actual performance from different locations is valuable and reflects some items that can’t be easily calculated.
By using an automated performance tool, you can run this kind of test during the workday and also off hours when congestion may be less. This is very useful information when you have a dispersed workforce that all has to use a central solution to do required functions.
Test Step Measured
Response in Seconds
Low bandwidth location
The location with the worst performance was able to get the capital funding to improve its network speed even though it is a smaller location.
Issue #3: Performance issue in a remote location is not seen in other localities
Response time for an application used globally became unacceptable during the business day in India but not in other locations.
As part of a global rollout of a new defect management tool, we were consolidating users on six different defect-tracking tools to a single solution with the servers located in Ohio. During the performance testing of the application, we saw an unexpected performance issue for our team in India. For these users, we saw network latency grew dramatically during the day to the point where the application would generate errors due to extreme packet loss.
Our investigation first focused on the application being implemented, but we determined the issue was a question of network load and limited bandwidth on the WAN. Once we understood where the processing delay was coming from, we were able to drill into the activities present on the LAN during the period in question. We found the local network at the office was being overloaded with nonbusiness-related traffic, such as videos and music. This was the root cause of the performance issue.
Our fix was to change the policy to limit the amount of nonwork-related material that could be streamed on the company LAN.
The cost was small in that what was required was a policy change on the existing proxy network setting, and this simple change greatly improved the performance consistency of all LAN/WAN-driven applications, as seen below in this chart of average network latency.
Some Parting Advice
Make sure your scenario does not create real records in production or excessive load on a production system during the business day. Instead, test your script in QA or in off hours. Finally, document your results using terms and values your customer can understand, and communicate this data in a way that makes sense to them.
Your load performance testing tool can do more than just load test in development. It can be a great tool for helping the production support teams better manage their networks and users.