Testing the Extract, Transform, and Load Process in Data Warehouses

Summary:

Pulling data from a source system and putting it into a data warehouse is a process commonly known as extract, transform, and load, or ETL. Testing the process can be a chore—you need to be sure all appropriate data is extracted, that it is transformed correctly to match the data warehouse schema, and that it's all imported. Instead of testing the ETL process as a black box, you can pull it apart, testing each piece in isolation.

A few years ago, my company was building something I called a “data bridge,” which essentially pulled data from one system, like transaction processing, into another, like an accounting package. I wanted insight into how to test it.

The people I did not respect said, “Oh, yeah, we can do that for you. We have experts.” The ones I did said something more like, “Hm, that’s interesting. I’ll have to think about that.”

Today I know to call that type of software process extract, transform, and load, commonly known as ETL. I’ve also accumulated a collection of techniques to test it.

In this article I will refer to the source system as what we are extracting from and the target system as where we are putting the results. These systems might be third-party software, internal software, or a data warehouse. A “row” of data is an item to be imported from one system to another: a transaction, a claim, a customer, an account, and so on. For our purposes, it doesn’t matter, but data warehouses do represent a slightly different wrinkle.

The Quick and Easy Way

The way most people test an ETL exchange is to make the test system a copy of production, perhaps with some personal information sanitized. Then they either identify a row of data for every unique condition in the requirements or create that row in the test system. Next, they run the interface and import the data, then check the target system to make sure the data is correct for each of those rows.

If you really want to get fancy, you can perform some operations in the target system and make sure the data behaves appropriately. A clever tester will create “bad” records to see how they are handled, but, otherwise, that is about it. If the scenarios run, call the ETL process done.

However, I think this approach seems overly simplistic. Instead of testing the ETL process as a black box, we can pull it apart, testing each piece in isolation.

Testing Each Piece

The result of a data extract is usually a large text file. These days, it might be an API response or a series of API responses.

It is usually easy enough to write a SELECT COUNT(*) query against the source code in order to make sure the number of rows that are returned match. After that, continue to write queries to check that the data match. If possible, do it twice—write another query, save the results to disk, and find some way to compare the two files to make sure they match. I did this once in a health insurance system with four hundred thousand records, and we found a hidden case for what to fill in a field for about five thousand rows. The programmer and I had interpreted the requirements differently.

Another way to test the extract process is to have a standard database. Load up the database in the test system, run the extract, and compare it to a file checked into version control. In some cases we may need to swizzle, null out, or replace dates in the file on disk with system date, but otherwise, this is pretty straightforward.

In the “transform” step, we take plain text and turn it into XML, or change XML into a .CSV file, EDI, or some other format. Testing this part of the process is the same as testing the extract part: Either have a known source system or write it twice.

Finally, we have the load part of the process. Find a way to check rows before and after the import. If you have a “standard” transformed file to import, you can clear out the database between runs and dump the database to disk, then compare what the database should look like after a run to how it actually looks with a diffing tool.

Test overwrite, where a row pre-exists but is changed, if a business process allows for it. Explore where the data goes, how it is used, and if the target software provider has any rules about what can and should be updated.

Other Things to Test

One hidden issue with ETL and data warehouses is downtime. The first time we run an extract, it will probably load pretty quickly—but what happens in year two with a daily process?

Two of the systems I have tested recalculated balances on every run. The first slowly grew from a ten-minute run to a three-hour run, and we had to add a lot of memory. The second timed out within two weeks of starting. Not only did the programmer recalculate balances from scratch, he also nested a slow select loop (get all customers) within another (for each customer, get all transactions). Yes, I did try to fail that one on code review, but that’s a different story.

Don’t just clear out the database and retest. Test days and months and years on data added to the database, along with a growing customer base that purchases products more often. Model realistic growth of the system, and look at how this impacts uptime.

Fake Out the Tables

If you need help with ETL testing, we are probably talking more like millions and tens of millions of rows. Just running the extract can take hours, only to find a error. That means a single test run can take a day or two. That’s a feedback loop that is far too slow.

Some complex databases have the idea of a “schema,” or user, and a hierarchy of users. For example, say you are logged in as TEST_USER, and call:

SELECT * FROM EMPLOYEE;

The system will first look for a table in the TEST_USER schema called EMPLOYEE. If it fails to find that, it will look in the next level up (perhaps OPERATION_USER), then above that (perhaps APPLICATION) and above that (perhaps SYSTEM). Typically the database administrators define these tables.

Here’s the good news. You can create a fake table with a small number of employees in the TEST_USER schema, then run the interface as TEST_USER to get results limited to those employees.

Identify the “driving” table—the table with the most rows—and create a fake table in your local schema with the same name and a much smaller dataset. Instead of ten million rows, you might get a hundred. The interface will run faster, and you can try the strategies above, testing against a known good canned dataset.

Of course, you’ll probably want to try with realistic data for a final check, but the strategy allows you to radically tighten the feedback loop with good enough results, enough of the time.

The article format only allows me to sketch out a few strategies at a high level for testing an ETL process. What are your favorite tricks?

Tags:

User Comments

3 comments

Hi Matthew,

Enjoyed the article. On a recent data warehouse project I was on, I used queries to make sure that there were no non-allowed null values, as well as checking for duplicates which should have been there. My workhorse query, however, proved to be a source-target string comparison like so:

select T.TargetKey, T.TargetAttribute, S.SourceKey, S.SourceAttribute from TargetTable T

inner join SourceTable S

on T.TargetKey = S.SourceKey

WHERE T.TargetAttribute <> S.SourceAttribute

I found errors pretty easily with this one.

Hi Matt, good article with useful techniques.

A word in support of the simplistic black box approach: While it is complimentary to the piecemeal testing, it has two advantages: 1) It has the potential to reflect user expectations - business users often assess the data warehouse based on how well it corresponds with the source data (including of course agreed-upon business rules). So a test that queries the source data and diff compares with expected result set on the target can help with that. The test is all the more useful if instead of the simple count(*)/group by high level results one can use a tool that enables building test cases that assert detailed source data and compares with expected target data. This is also makes it easy for team members to follow up any issues than starting with aggregated discrepancies. 2) This black box testing, being an end-to-end comparison, automatically catches any material errors that happen in the ETL process. The piecemeal testing to be comprehensive needs to test all steps in the process to reliably detect source-target discrepancies.

complementary

About the author

Matt Heusser

Matthew Heusser is a consulting software tester and software process naturalist, who has spent his entire adult life developing, testing, and managing software projects. He has served as the lead organizer of the Great Lakes Software Excellence Conference, organized a workshop on technical debt, and taught information systems at Calvin College. Matthew blogs at Creative Chaos, is a contributing editor to Software Test & Quality Assurance magazine, and is on the board of directors of the Association for Software Testing. Matthew recently served as lead editor for How to Reduce the Cost of Software Testing (Taylor and Francis, 2011). Follow Matthew on Twitter at @mheusser or email him at [email protected].

Apr 28	STAREAST Software Testing Conference in Orlando & Online
Jun 02	AI Con USA Bridging Minds and Machines
Sep 22	STARWEST Software Testing Conference in Anaheim & Online
Oct 13	Agile + DevOps USA The Conference for Agile and DevOps Professionals

May 09	Beyond Compliance: Ensuring Accessibility with Innovative Testing Strategies
May 16	Test Ahead of the Curve: Insights for Developing a Superior Test Coverage
May 23	How Generative AI Boosts Speed and Quality in Software Testing
On Demand	Building Confidence in Your Automation
On Demand	Leveraging Open Source Tools for DevSecOps

Welcome