7 Steps to Improving Your Data Testing

When you have tens of thousands of rows of data, how do you know what to test or how much to test? A set percentage? Random test cases? When do you stop testing? It can be overwhelming. Here are seven steps to help your team streamline their data testing efforts and know what to test, how much to test, and when to stop testing.

You are migrating to a new software application and need to extract your data. How do you know if you extracted the right thing? How do you know that the information is correct?

Well, that is what testing is for, right? But when you have tens of thousands of rows of data, how do you know what to test or how much to test? A set percentage? Random test cases? It can be overwhelming.

When do you stop testing? This is a hard decision for experienced testers, so what about the inexperienced testers, or those who don’t consider themselves testers? Worse, what if your project is falling behind but the data being sent is continually wrong? You can’t cut testing, and you are getting pressure to deliver high-quality results with less and less time.

I found myself in a position to help a project that was in this situation. The project had fallen twenty-five days behind and had a fixed date of deployment. They had smart developers and smart business analysts, but they were struggling. There was just too much data from a system that wasn’t familiar to any of them.

The business analyst who was assigned to the project was new to the system and to the company. Realizing the project was floundering, the manager added another business analyst to help with testing, but she was told to just start testing and wasn’t given time to do any of the analysis that made her great at her job. There were a number of things that were going wrong but no time to fix anything. 

With no test plan and a number of very large files to test, it was hard for them to know when the file had been tested “enough.” Overtime was becoming required and the project was going horribly over budget, and the vendor with whom they were working was starting to lose his temper.

That is when I had the privilege of coming in. My job was to help streamline their testing efforts. I looked at the process, made a test plan, and suggested a few simple changes.

Here are seven steps to help you do the same thing for your project. This process lets the team know what to test, how much to test, and when to stop testing.

1. Make sure everyone is looking at the same requirements.

This seems like a no-brainer, but especially in an agile environment, it is really easy to get busy and start recording changes only in emails or not at all. Many times you only have an original template and a handful of emails that document what is expected. In the project I described above, they ended up with every person having a different copy of a template that they had all made personal notes on and a few files that had no template at all, with only verbal and emailed notes available.

It is vital to have the whole team using the same template (or requirement document, JIRA ticket, or whatever flavor your company uses) to code and test against. The template needs to be continually updated with whatever is decided as you go.

Even in a waterfall environment, there are still notes to maintain as things are coded and bugs are found. It is important to have everyone involved looking at this document, from the business to the programmer, the tester, and even the vendor, if one is involved. Not every company is set up to allow this to happen, so if yours isn’t, be aware that you will have problems because of it. Allow for more time for bugs and be prepared for misunderstandings about what to expect. You can work around not having a set requirements document, but it will take some extra effort.

2. Write a test plan.

With data testing, it is tempting to skip writing a test plan because the work seems simple and the test can end up short and basic. It may feel like a waste of time, but it is not. A well-written test plan gives the testers clear boundaries on what to test and how much to test so they know when it is safe (or at least safe enough) to mark as passed.

We had two testers who were repeating each other’s work because no one had written down what the plan was. Even when there is only one tester, it lets the programmers know what is being looked at so they can mention if something doesn’t make sense to test or if something important is being skipped before testing starts, instead of after something goes wrong in production. It aids in communicating with the business, management, test leads, and everyone else involved in what is expected from testing.

When a file of 50,000 lines of data shows up in your mailbox to be tested and the results of the testing are needed yesterday, that’s not when you want to decide what, where, and how much to test. That’s when you’ll appreciate already having a detailed test plan.

3. Decide what to test.

Do you care if the data is valid? Perhaps you only care if the data format is correct. Or maybe you need to know both. This is where to decide that.

In our case, we realized that we mostly cared that the data format was correct, as we were converting between different system formats, but it also needed to match what was in the test environment in order to ensure that the correct system was being utilized. You also should note where valid formats are recorded (hopefully in your living requirements document!) to make it simple to execute the test scenarios.

4. Decide where to test.

Keep track of where the data is coming from. Are you developing and testing against a snapshot of the environment? What do you need to compare your data file against for verifying that the data is valid? The test plan is where you should put that, for example, you need to compare the data to the dev.com site, not the test.com site, for verifying that the data values are what was expected.

When you refresh your snapshot, what test cases might need to be replaced? It is a good idea to include SQL statements with the test plan to get new test cases if you know that something is likely to change. For example, in a payroll project, one of the test cases should be if someone is on a leave of absence and you expect that person to be back when you refresh. Providing a clear way of finding a replacement test case becomes very important and will save you time and stress later.

5. Decide how much to test.

What test cases are needed to ensure that highest places of risk are addressed? This is often the hardest part to decide. There’s no way to test everything, and trying to do so is costly and time-consuming.

Do you test a certain percentage? I know a tester who is fond of testing 10 percent of a file, but that is highly impractical the larger your document gets.

I have found that determining what variety you can get and finding representative test cases will give good coverage of what could happen with the data, as well as giving a clear stopping point. Our data involved the payroll system, which had three different company codes, four benefit types, eight payroll status types, two retirement options, and a variety of other information that are string or number values and vary based on the person. Thankfully, everything after retirement options were all various things like names or salary or were tied to the company, benefit type, status, or retirement options. So, if we cover the test cases based on those areas, then the rest of the data should be tested fairly well.

However, if we chose a test case for every variation, even ignoring everything after retirement options, we ended up with 192 test cases. That is too many to be practical! Instead, we realized four of the payroll status types are rare, so we decided we only needed one test case for each of them and ignored that they had the same company code. We did want to have representation for each of the different company codes, and the benefit types were where the majority of our data varied, so we picked out one test case per benefit type per company. Now we had sixteen test cases and we really just need a few of the retirement options to be mixed in, so we make sure that the test cases we picked had both options represented. Payroll statuses were mostly represented already, but we missed one type, so we picked a person from the company with the most people and with the benefit type that is the most common. This brought us to seventeen total test cases—completely manageable.

Now, as we are testing, if we notice someone whose information is “weird” or if the business mentions they always have trouble with a few types of data, we add them to the list. This way we are not put testing behind, and we still end up with a reasonable, short list.

6. Keep an eye out for odd things.

One of the hardest things to teach someone is that part of testing is looking for things that don’t belong or just don’t look right. If you find something odd, track it down. For example, you might notice that if a last name has a space or a special character, the data extract splits the row into two rows or drops half the data.

It is wise to timebox this activity, though. It can be easy to spend an hour tracking down something that turned out to be nothing. Give yourself five to fifteen minutes to track it down and then report it. If you still aren’t sure what you are looking at, ask someone to look at it.

7. Stay focused on what you are trying to test.

This is a hard one to balance with the previous item, but it isn’t about ignoring issues. Rather, keep in mind what the goal of your testing is.

In the case of an extract, usually you aren’t looking for data that was entered incorrectly; you’re checking that what is in the current system is what you are pulling. This means that if you have a person the system says is paying into a 401(k) but is marked as retired, it is not a bug for your project. As long as both the extract and the system say the same thing, the fact that data should never have been entered into the system does not matter for your results.

In a case like this, write it down and email the business or whoever is responsible for correcting it. However, that is not a failure for your testing purposes.

The tricky thing with testing is that there is a lot of variety in what you might be testing, so there are a good number of exceptions. This list is not complete. It skips a few obvious pieces, like making sure you got the number of rows you expected. However, these seven steps should give you a solid place to start—and let you know where to stop.

User Comments

Marta Dabrowska's picture

More and more software development processes and big data management are run with lean methodologies. However, application and data testing are very often neglected in order to reduce initial costs and time. But I think it's wrong! And your process is very good - we use similar steps to test out data and apps at our work. Thanks! 

November 30, 2017 - 6:11am
Karis Van Valin's picture

Often testing is sacrificed to reduce cost and time regardless of the methodology used and, while it works initially, it is the most effective way to sabotage your product long term.  I believe that there is a better option, it has a higher cost initially, but has a higher payoff in the end product.

November 30, 2017 - 8:56am
Adrian Witas's picture

Besides all consideration, and strategy it also critical to automate all the data testing cohesively.

You might consider some open source framework like endly: helping with it.



June 27, 2018 - 1:29pm

About the author

StickyMinds is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.