Synthesize Your Test Data


In a society growing ever conscious of the benefits of organic materials, "synthetic" is a dirty word. But in this week's column, Danny R. Faught argues that when you're designing tests, synthetic data is the way to go. Read on to learn why it's important to be the master of your data.

I'm going to let you in on a secret. When I interview someone for a software testing job, I have one weed-out question that lets me know quickly whether someone understands the basics of testing:

"Tell me how you would test the operating system's feature that lists files on the hard drive. Choose the utility you're most familiar with to test, such as Finder, Explorer, ls, or dir."

Unfortunately, the difference between a good answer and one that leads me to send a résumé to the shredder is not an answer that you're likely to see taught explicitly in a book or a course. Here's the beginning of an all-too-frequently-heard bad answer: "Well, I would look around on the disk to see what files I can use to test with." Can you see what's wrong with that answer?

Good testers know that they need to control their test data; they don't limit themselves to what happens to be lying around on the disk. They use disciplined test design techniques, which usually means that they create the test data for their tests. They never would skip tests simply because they can't find the right data. Sometimes the only oracle that can tell if the tests pass is knowledge of how the data was created.

Check Your Timing
The kind of test data I'm talking about here is the input that's put in place before starting a test, as a part of the test setup. This may be the contents of a text field that a Web browser automatically populates with previously given information, a file that's loaded into a word processor, or the contents of a database. When you design tests that need to have test data set up in advance, you need to define the procedure for putting it there, using one of these approaches:

  • On the fly-You create the data right before running each test or suite of tests, every time you run them. You can investigate how long it takes to set up the data, how hard the setup procedure is to automate, and how feasible it is to set up and tear down the data repeatedly. For a word processing file this could be as simple as copying the file, but for a database the task may be complex.
  • Created in advance-You may set up the test data once and leave it in place for days or months at a time, especially if it's difficult to set up. In this case, consider how you regularly will check that the data is still valid. What do you do if a test modifies the data in a way that will cause other tests to fail?

In both scenarios, think about any problems that might arise if two people are using the same data source.

The "created in advance" option is the most common because it's easier, especially if you're not validating the test data. But "on the fly" tends to be much more robust if you can find a feasible way to do it.

Test Data Mechanics
You may have more than one way of building test data, either through the application's user interface or by directly creating a file or database. If you work through the user interface-entering data as the user would-you have the advantage of getting additional test coverage while you're building the data. However, this process may be so time consuming that it's not practical for building a large volume of data. Also, some kinds of relationships among the data may be impractical to set up this way, such as a series of transaction dates spread out over a year.

Bypassing the application when you build your data gives you tremendous power for synthesizing a broad range of test scenarios. But you also might have tremendous difficulty figuring out how to piece together a valid data set or, in the case of a negative test, a data set with only the error that you intended to inject. There might not be any documentation on the internal format of the data that you're working with. Also, if you report a bug using synthetic data and you don't know how to reproduce it using data created through the application, you might have trouble getting anyone to pay attention to the bug, especially if robustness isn't an important attribute of the application.

Often a hybrid approach is best. When you're working with a database, you can start with a snapshot of production data, if available, and then layer your test data on top of it. Be prepared to insert your test data each time you get a new copy of the production data. If you have any specific requirements on the contents of the database, set this up in your test data; don't assume that the production data will satisfy your needs. If you're creating the data in a file with a specific format, you first can create a valid file from within the application and then edit the contents of the file directly to suit your needs.

The Challenges Are Real
You are likely to encounter some frustrating challenges when you try to synthesize test data. For example, if you want to create or modify a test database, you may find that only database administrators (DBA) are allowed to do this, and you may have difficulty convincing an overworked DBA to use his powers to help you. Even if you do have access to the database resources you need, you may not have the skills required to get the job done. Whatever the challenges, they are rarely the result of one bad management decision, but rather a complex web of limitations ranging from poor testability of a legacy application to a lack of human or machine resources.

The sad reality is that some test teams have given up on creating the data that would enable them to implement adequate test designs. Instead, the habit of testing only with production data has become thoroughly ingrained into the culture.

I'll leave you with a few ideas to help you with your quest in building solid test data.

  • Master several test design techniques so you know what kind of test data you need and can explain clearly why you need it.
  • Use a reliable setup procedure for the data so you reduce headaches caused by corrupted test data.
  • Inform your management of the tests that you want to run but can't run because of limited control over the test data. This will help managers determine how to allocate resources to manage the risk of inadequate testing.
  • Try to gain small victories toward removing roadblocks that make it difficult to synthesize test data. Start with a small step that's not difficult to achieve, and be patient as you continue to improve your organization's test design capabilities.
  • Learn the technical skills that will enable you to synthesize test data, such as: SQL and database administration, how to use test data generator tools, how to program so you can build your own data generators, and how to navigate your operating system so you can find the best way to access the test data.

Going from an abstract test design technique to a suite of tests that you actually can run against your application takes creativity and determination. Your software's well-being is at stake, so don't shy away from the challenge.

Further reading


About the author

StickyMinds is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.