Testing the Extract, Transform, and Load Process in Data Warehouses

[article]
Summary:
Pulling data from a source system and putting it into a data warehouse is a process commonly known as extract, transform, and load, or ETL. Testing the process can be a chore—you need to be sure all appropriate data is extracted, that it is transformed correctly to match the data warehouse schema, and that it's all imported. Instead of testing the ETL process as a black box, you can pull it apart, testing each piece in isolation.

A few years ago, my company was building something I called a “data bridge,” which essentially pulled data from one system, like transaction processing, into another, like an accounting package. I wanted insight into how to test it.

The people I did not respect said, “Oh, yeah, we can do that for you. We have experts.” The ones I did said something more like, “Hm, that’s interesting. I’ll have to think about that.”

Today I know to call that type of software process extract, transform, and load, commonly known as ETL. I’ve also accumulated a collection of techniques to test it.

In this article I will refer to the source system as what we are extracting from and the target system as where we are putting the results. These systems might be third-party software, internal software, or a data warehouse. A “row” of data is an item to be imported from one system to another: a transaction, a claim, a customer, an account, and so on. For our purposes, it doesn’t matter, but data warehouses do represent a slightly different wrinkle.

The Quick and Easy Way

The way most people test an ETL exchange is to make the test system a copy of production, perhaps with some personal information sanitized. Then they either identify a row of data for every unique condition in the requirements or create that row in the test system. Next, they run the interface and import the data, then check the target system to make sure the data is correct for each of those rows.

If you really want to get fancy, you can perform some operations in the target system and make sure the data behaves appropriately. A clever tester will create “bad” records to see how they are handled, but, otherwise, that is about it. If the scenarios run, call the ETL process done.

However, I think this approach seems overly simplistic. Instead of testing the ETL process as a black box, we can pull it apart, testing each piece in isolation.

Testing Each Piece

The result of a data extract is usually a large text file. These days, it might be an API response or a series of API responses.

It is usually easy enough to write a SELECT COUNT(*) query against the source code in order to make sure the number of rows that are returned match. After that, continue to write queries to check that the data match. If possible, do it twice—write another query, save the results to disk, and find some way to compare the two files to make sure they match. I did this once in a health insurance system with four hundred thousand records, and we found a hidden case for what to fill in a field for about five thousand rows. The programmer and I had interpreted the requirements differently.

Another way to test the extract process is to have a standard database. Load up the database in the test system, run the extract, and compare it to a file checked into version control. In some cases we may need to swizzle, null out, or replace dates in the file on disk with system date, but otherwise, this is pretty straightforward.

In the “transform” step, we take plain text and turn it into XML, or change XML into a .CSV file, EDI, or some other format. Testing this part of the process is the same as testing the extract part: Either have a known source system or write it twice.

Finally, we have the load part of the process. Find a way to check rows before and after the import. If you have a “standard” transformed file to import, you can clear out the database between runs and dump the database to disk, then compare what the database should look like after a run to how it actually looks with a diffing tool.

Test overwrite, where a row pre-exists but is changed, if a business process allows for it. Explore where the data goes, how it is used, and if the target software provider has any rules about what can and should be updated.

Other Things to Test

One hidden issue with ETL and data warehouses is downtime. The first time we run an extract, it will probably load pretty quickly—but what happens in year two with a daily process?

Two of the systems I have tested recalculated balances on every run. The first slowly grew from a ten-minute run to a three-hour run, and we had to add a lot of memory. The second timed out within two weeks of starting. Not only did the programmer recalculate balances from scratch, he also nested a slow select loop (get all customers) within another (for each customer, get all transactions). Yes, I did try to fail that one on code review, but that’s a different story.

Don’t just clear out the database and retest. Test days and months and years on data added to the database, along with a growing customer base that purchases products more often. Model realistic growth of the system, and look at how this impacts uptime.

Fake Out the Tables

If you need help with ETL testing, we are probably talking more like millions and tens of millions of rows. Just running the extract can take hours, only to find a error. That means a single test run can take a day or two. That’s a feedback loop that is far too slow.

Some complex databases have the idea of a “schema,” or user, and a hierarchy of users. For example, say you are logged in as TEST_USER, and call:

SELECT * FROM EMPLOYEE;

The system will first look for a table in the TEST_USER schema called EMPLOYEE. If it fails to find that, it will look in the next level up (perhaps OPERATION_USER), then above that (perhaps APPLICATION) and above that (perhaps SYSTEM). Typically the database administrators define these tables.

Here’s the good news. You can create a fake table with a small number of employees in the TEST_USER schema, then run the interface as TEST_USER to get results limited to those employees.

Identify the “driving” table—the table with the most rows—and create a fake table in your local schema with the same name and a much smaller dataset. Instead of ten million rows, you might get a hundred. The interface will run faster, and you can try the strategies above, testing against a known good canned dataset.

Of course, you’ll probably want to try with realistic data for a final check, but the strategy allows you to radically tighten the feedback loop with good enough results, enough of the time.

The article format only allows me to sketch out a few strategies at a high level for testing an ETL process. What are your favorite tricks?

User Comments

3 comments
Jim  Peers's picture

Hi Matthew,

Enjoyed the article. On a recent data warehouse project I was on, I used queries to make sure that there were no non-allowed null values, as well as checking for duplicates which should have been there. My workhorse query, however, proved to be a source-target string comparison like so:

select T.TargetKey, T.TargetAttribute, S.SourceKey, S.SourceAttribute from TargetTable T

inner join SourceTable S

on T.TargetKey = S.SourceKey

 

WHERE T.TargetAttribute <> S.SourceAttribute

 

I found errors pretty easily with this one.

June 27, 2017 - 2:48pm
Francois Marais's picture

Hi Matt, good article with useful techniques.

A word in support of the simplistic black box approach: While it is complimentary to the piecemeal testing, it has two advantages: 1) It has the potential to reflect user expectations - business users often assess the data warehouse based on how well it corresponds with the source data (including of course agreed-upon business rules). So a test that queries the source data and diff compares with expected result set on the target can help with that. The test is all the more useful if instead of the simple count(*)/group by high level results one can use a tool that enables building test cases that assert detailed source data and compares with expected target data. This is also makes it easy for team members to follow up any issues than starting with aggregated discrepancies. 2) This black box testing, being an end-to-end comparison,  automatically catches any material errors that happen in the ETL process. The piecemeal testing to be comprehensive needs to test all steps in the process to reliably detect source-target discrepancies. 

July 14, 2017 - 2:47am

About the author

StickyMinds is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.