Load testing takes a lot of time and effort to set up correctly. There are many factors to plan for and implement, and some of them can be quite subtle.
There’s a tremendous amount of value in getting even a rudimentary load testing scenario up and running quickly. Immediate feedback on the state of your system can reap great rewards, but in many cases, it’s extremely important to quickly move beyond the initial basic load testing effort. These simplistic scenarios can present an incomplete—or worse, outright misleading—picture of how your system is behaving under stress.
Taking the time and effort to move beyond basic load testing is a crucial step to ensure that stakeholders and customers receive accurate metrics in order to make correct decisions.
One of the most critical, often overlooked aspects of getting an accurate picture of your system’s performance is the data you use when testing your system. This is unfortunate, because the size, trend, and “shape” of the data has a tremendous impact on so many things across your entire application: UI rendering, processing in the business layer, and, of course, the data tier. (Note: “Shape of data” is a phrase I’ve come up with to describe things like trends and distribution of data. For example, a social networking platform may be extremely heavy in blog usage but have little wiki or media content.)
If you’re working with a business-intelligence or data-analytics system, it’s simply unrealistic to have only a clean or empty data set when checking your load. I’ll actually go a step further and say it’s downright irresponsible. You need your scenarios to validate against the realistic demands of processing months or years of data. Similarly, if you’re trying to profile an e-commerce site, you want a realistic set of products, reviews, customer records, etc., to comprise your working data set.
Getting ahold of or creating your data sets is an important task for which you’ve got to plan and dedicate time. Are you going to create your data, or are you going to use a real-world data set? Both options bring their own sets of challenges and constraints.
Real-World Data, Real-World Headaches
If you’re lucky, you may be able to get ahold of a real-world data set. There’s nothing better than using data that represents exactly how the system is used in the wild! In the past, I’ve reached out to customers with whom I’ve had great relationships. Using a “live” data set from a customer often means coming up with some scripts to sanitize the data. You want to ensure that you’re respecting and protecting the customers’ sensitive data, and sometimes this data may have potential legal liabilities attached to it.
If you’re sanitizing a real data set, you’ll need to ensure that you’re not changing the trend, size, or shape of the data. If you’re trying to eliminate potentially sensitive discussion threads from a company forum or mail list, you’ll want to ensure that you replace the discussions with text that’s similar in word count. The same example goes for other types of data, like documents, media files, etc.
You’ll also want to avoid changing dates around data-creation events, because significant impacts might happen across your system—e.g., a trend-analysis routine might run blisteringly fast when there’s only one date but fall apart when pointed at data distributed across several years.