One of the cardinal rules of data analysis is to validate your data before you start the analysis.
I was recently given an important project where my results would provide senior leadership with new insight on existing data. I did my interview with the person requesting the analysis and felt I had a pretty good understanding of what I was supposed to do. I was given the files for analysis, and they seemed to have all the fields I needed. I met with the team who supplied the data and got an understanding of the fields I needed to analyze, and I went to town.
The process I had to do was complex, had multiple translational steps, and was very time-consuming. Finally, after about a hundred hours of effort, I was ready to share my findings, and I felt pretty good going into the meeting.
However, during the meeting it became apparent that my results were only about two-thirds of the expected totals. Either I really goofed it up, or the data I started with was incomplete.
I went back to the raw data I was given and confirmed that while I had the correct record counts, the data I was using in my calculations was incomplete due to filtering prior to its being released to me. I came to learn the data I was given did not include everything I needed due to the original purpose for the data extract. I requested and got access to a summary report that reflected the correct totals I should match to, which I should have asked for first.
I met with the data team to determine what went wrong and how best to acquire the data—all the data—so I could complete my assignment. A couple of days later, I got a new data set to review. This time, I did the total comparisons at the start, and happily, the totals matched. Except as I started to organize the data, I found a new issue.
One of the fields I used in the calculation had been set incorrectly, so instead of values ranging from 1 to 150, they went from 1 to 97,500! The field had been set to sum, not unique. The good news is that I detected the issue before I started the analysis. I alerted the data team to the issue, and we all understood what happened and how it needed to be corrected.
I actually had two data sets: one had 18,000 records, and the other just under 400,000. After the third try for analysis, I checked and the record counts and totals looked correct. However, as I processed the data, I wanted to make it go faster.
After a couple of failed attempts, I now really understood the process. I decided to go with an Excel macro to automate the process. It made sense, as the logic was simple and I already had the data in Excel. I got the process running with the macro, and it worked … but I had a lot of records, and based on the time needed to process a small data set, I determined it would take Excel more than 30 hours to process the data. That did not make good use of the time.
Still, I had a proven macro that we could use as a functional model to move the process into SQL. With the working macro, I also wrote a one-page design document to help my developers understand what my needs really were. This helped me focus and reduced the scope creep that can tend to happen.
I also had a working example with sample data (input) and sample results (output), and my SQL team was able to create the SQL and a database to use in the process. With my working SQL, I could now do additional work that had originally been deemed too hard to include in the work problem.
At the conclusion of the project, I ended up with a simple process that provides a view into our business that was not previously seen, and a number of analysis opportunities have been revealed. These are real opportunities that had not been suspected until this project, and they made a visible difference all the way to our shareholders. I will also be able to use my new, faster process to examine a much larger data set of 18 million records and gain additional insight that was previously not cost effective to collect.
This experience is not about people doing things wrong; it just demonstrates how making small process changes can have a real impact.
Here are some more tips for how you can strive to get the most accurate data analysis possible—the first time.
Validate your data set
Even if you generate your own data extract, you need to validate your data set to ensure you are on stable footing before you start your work. Do both a row count and summaries on key columns. Depending on how you extract data or do calculations, rounding errors can cause your data to not match the source. It is easy to get lost in the weeds when you are processing a file with more than 100,000 records, so have a plan for how to validate your results.
Document your steps
Each time you translate, change, or filter the data, you need to compare what you expected to happen with what actually happened. Grabbing a single row and validating back to the raw source and your data can be a good check.
When you clean the data, save the sweepings
When you filter or remove data that does not meet the criteria you are seeking, save it off and name it for what it is. You may find something you need (or want) to see hidden in those discards. Data outliers can sometimes be the most interesting things.
Detail design specifications
Bring in the pros when you are ready, but have a design spec with test data that is easy to validate. My development team was able to respond quickly because I had taken the time to really document my needs and provide a simple example for design as well as a larger sample with expected results that allowed the script to be verified.
Understand the results
In my actual results I found my numbers were correctly calculated but were not what was expected. This is where data analytics really shows up. I had to determine what could be the factors for the difference. By having worked at the detail level, I was able to see a significant factor not recognized previously. If I had not done the ground-level work, I might not have seen this issue and developed the new information for consideration on a policy change. We accomplished the original project and also gained new insight into our business.