Data Crunching

[article]
Part 1
Summary:

Data crunching is probably the least glamorous part of our jobs, but it has to be done. Someone will always need to recycle legacy code, translate files from one vendor's proprietary binary format into XML, check the integrity of configuration files, or search Web logs to see how many people have downloaded the latest release of the product. Knowing how to compile this data with the least amount of effort can be crucial to a project's success or failure. In this week's column, Greg Wilson looks at some of the existing tools and techniques used to crunch data more efficiently and productively.

It's 9:00 on a Monday morning. You're sitting at your desk savoring that precious first cup of coffee and looking forward to finally finishing that rendering routine when your boss knocks on your door. She says, "I have a little job for you." It seems the product manager was wrong: customers do want to convert their old flat-text input files into XML. Oh, and the three people who actually bought Version 6.1 of the product want to merge parameters from the database as well. Now you've got to take care of it--by the end of the day.

Little data crunching jobs like this come up every day in our business. They aren't glamorous, but knowing how to do them with the least amount of effort can be crucial to a project's success or failure.

Fifteen years ago, most data crunching problems could be handled using classic Unix command line tools, which are designed to process streams of text one line at a time. Today, however, data is more often marked up in some dialect of XML or stored in a relational database. The bad news is that grep, cut, and sed can't handle such data directly. The good news is that newer tools can, and the same data crunching techniques that worked in 1975 can be applied today.

This article looks at what those tools and techniques are, and how they can make you more productive. We start with a simple problem: how to parse a text file.

Extracting Data from Text

The first step in solving any data-crunching problem is to get a fresh cup of coffee. The second is to figure out what your input looks like and what you're supposed to produce from it. In this case, the input consists of parameter files with a .par extension, each of which looks like this:


Each line is a single setting. Its name is at the start of the line and its value or values are inside parentheses (separated by commas if necessary).
The output should look like this:


Most data-crunching problems can be broken down into three steps: reading the input data, transforming it, and writing the results. wc *.par tells us that the largest input file we have to deal with is only 217 lines long, so the easiest thing to do is read each one into an array of strings for further processing. We'll then parse those lines, transform the data into XML, and write that XML to the output file. In Python, this is:


Separating input, processing, and output like this has two benefits: it makes debugging easier and allows us to reuse the input and output code in other situations. In this case the input and output are simple enough that we're not likely to recycle them elsewhere, but it's still a good idea to train yourself to write data crunchers this way. If nothing else, it'll make it easier for the next person to read.

All right, let's begin by separating the variable name from its parameters, then separate the parameters from each other. Hmm . . . can there ever be spaces between the variable name and the start of the parameter list? grep can tell us:


Another quick check shows that while parameter values are usually separated by a comma and a space--sometimes there's only a comma.


This sounds like a job for regular expressions, which are the power tools of text processing. Most modern programming languages have a regular expression (RE) library. A few, like Perl and Ruby, have even made it part of the language. A RE

User Comments

12 comments
Anonymous's picture
Anonymous

Thanks greg, what you have written is absolutely true and full of fact.<br/><br/>Santi Mahapatra, London

February 27, 2006 - 12:32pm
Anonymous's picture
Anonymous

Thanks greg, what you have written is absolutely true and full of fact.<br/><br/>Santi Mahapatra, London

February 27, 2006 - 12:32pm
Anonymous's picture
Anonymous

Thanks greg, what you have written is absolutely true and full of fact.<br/><br/>Santi Mahapatra, London

February 27, 2006 - 12:32pm
Anonymous's picture
Anonymous

Thanks greg, what you have written is absolutely true and full of fact.<br/><br/>Santi Mahapatra, London

February 27, 2006 - 12:32pm
Glenn Halstead's picture

Hi Greg,<br/><br/>Thanks for your article.<br/><br/>Near the end you mention that the xml output has no new lines to make it easy for a person to read and that this is unimportant as it will only be read by machine.<br/><br/>I think that making it readable remains highly important, even if it's never intended to be read by a human (in normal operation). When a problem does occur during development or when in use by a client the first thing someone will likley do to investigate it is to inspect the xml file. If the file is easy to read the it will be easier, quicker and less error prone to validate it manually.<br/><br/>regards<br/><br/>Glenn Halstead

February 28, 2006 - 7:09am
Glenn Halstead's picture

Hi Greg,<br/><br/>Thanks for your article.<br/><br/>Near the end you mention that the xml output has no new lines to make it easy for a person to read and that this is unimportant as it will only be read by machine.<br/><br/>I think that making it readable remains highly important, even if it's never intended to be read by a human (in normal operation). When a problem does occur during development or when in use by a client the first thing someone will likley do to investigate it is to inspect the xml file. If the file is easy to read the it will be easier, quicker and less error prone to validate it manually.<br/><br/>regards<br/><br/>Glenn Halstead

February 28, 2006 - 7:09am
Glenn Halstead's picture

Hi Greg,<br/><br/>Thanks for your article.<br/><br/>Near the end you mention that the xml output has no new lines to make it easy for a person to read and that this is unimportant as it will only be read by machine.<br/><br/>I think that making it readable remains highly important, even if it's never intended to be read by a human (in normal operation). When a problem does occur during development or when in use by a client the first thing someone will likley do to investigate it is to inspect the xml file. If the file is easy to read the it will be easier, quicker and less error prone to validate it manually.<br/><br/>regards<br/><br/>Glenn Halstead

February 28, 2006 - 7:09am
Glenn Halstead's picture

Hi Greg,<br/><br/>Thanks for your article.<br/><br/>Near the end you mention that the xml output has no new lines to make it easy for a person to read and that this is unimportant as it will only be read by machine.<br/><br/>I think that making it readable remains highly important, even if it's never intended to be read by a human (in normal operation). When a problem does occur during development or when in use by a client the first thing someone will likley do to investigate it is to inspect the xml file. If the file is easy to read the it will be easier, quicker and less error prone to validate it manually.<br/><br/>regards<br/><br/>Glenn Halstead

February 28, 2006 - 7:09am
John Leather's picture

Greg,<br/><br/>Great article, I look forward to part 2! I do have one question, where is Figure 1?<br/><br/>Thanks,<br/><br/>John Leather

March 1, 2006 - 3:19am
John Leather's picture

Greg,<br/><br/>Great article, I look forward to part 2! I do have one question, where is Figure 1?<br/><br/>Thanks,<br/><br/>John Leather

March 1, 2006 - 3:19am

Pages

About the author

Greg Wilson's picture Greg Wilson

Greg Wilson’s book Data Crunching was published by the Pragmatic Bookshelf in April 2005. He received a PhD in computer science from the University of Edinburgh in 1993 and is now a freelance software developer, a contributing editor at Doctor Dobb's Journal, and an adjunct professor in computer science at the University of Toronto.

StickyMinds is one of the growing communities of the TechWell network.

Featuring fresh, insightful stories, TechWell.com is the place to go for what is happening in software development and delivery.  Join the conversation now!