Big Data, Big Trouble: Getting into the Flow of Hadoop Testing

Maryam Umar

Big Data, one of the latest buzzwords in our industry, involves working with petabytes of data captured by various systems and making sense of that data in some way. Maryam Umar has found that testing systems like Hadoop is very challenging because of the frequency with which the data arrives in the system, the number of jobs that run to process that data, and the interdependency of the data. Maryam describes some of the projects at which involve identifying multiple users and using that data to make recommendations of hotels. Testing this is fairly difficult as we need an ability to represent the jobs being executed in the Hadoop ecosystem with an appropriate test tool. Maryam presents a few examples of how she has been able to overcome this challenge using the Oozie workflow coordinator as a test tool that works with the Hadoop file system (HDFS). She demonstrates how test code can be written in a non-testing tool to help gain confidence in the data produced as a result of running a job processor.

