It seems like you can’t pick up a technical magazine without reading about how big data is changing the world—and the untold implications of this technology. But what the heck is big data? And didn’t we already solve this thing with business intelligence and data warehousing?
Big data, or BD, is the collection of transaction-level detail for analysis. The data is kept close to the transactional detail so it can be examined for hidden trends only seen when you analyze the individual transactions. The data can come from different sources but is analyzed in a common pool. This is most often a feed (or copy) of the transactions as they occur; they are streamed to the BD solution. Often, the value of the data is very time-dependent; the sooner the information is available, the more valuable it is.
There are four key terms used when talking about BD:
- The volume of records in scope is large. Millions of records per day can and do occur.
- The velocity of records being created is fast, as BD is very granular, and collection of the data is close to real time.
- The veracity of the data, which is a fancy term for the quality of the data, refers to inaccuracies that can occur when processing high volumes of data from multiple sources. There is a need to develop methods to screen the data quickly to add an optimal level of accuracy to the volume and velocity.
- There is a variety of data-generating devices, and as the number increases, it will become even more important to be able to interpret and consume the data from these different sources.
These four factors make using a conventional relational database management system impractical for storing and quickly analyzing BD, so new methods are being developed.
So, what is a business intelligence data warehouse?
A BIDW is a data analysis system that collects the transactional information and typically provides summaries on selected key fields of the transactions being watched. These summaries can be used to better understand the overall health and trends in the transactions being monitored. The BIDW data is a copy of production and is not in real time, so long-running queries can be initiated without concerns about impacting the live customer actions. Data may be loaded daily or weekly, depending on the data source. The data is kept at several levels to serve the different customers of the BIDW; summary data and dashboards are the most common outputs of a BIDW, but if needed, you can drill into the transactions.
It is reasonable that at this point you are not seeing a real difference between BD and BIDW, as both can contain transactional-level detail, but these two tools are typically used for very different purposes.
The following examples should make this difference clearer.
The Subway System
Imagine a subway system with ten stops that runs hourly. It is likely the BIDW would store information summaries by stop and hour for people entering and leaving the subway. This kind of data would let the planners adjust the subway schedule to handle the volume of passengers. It could also collect information on the type of fares (commuter, student, etc.) that are used when entering the system. This provides the ability to monitor the overall health and effectiveness of the subway system.
If we want to get a perspective from the individual passenger—i.e., to see where people enter and leave the system—we need to drill deeper. To see detail at the transactional level, this is a job for BD, as the volume is high and the speed of the data is fast. Using BD, we can look at the individual user’s transactional detail (journey) to see more information, such as how the usage pattern changes during the day.
If I own a restaurant that is just outside station X, I could look up what time the passengers with the longest ride (and therefore the hungriest) are departing at my station so that I can plan when I need to make more food.
The Grocery Store
There is a store that only sells twenty items and is open 24/7. The BIDW likely knows the number of receipts and how much of each item is sold per day—even when each item is sold.
This is all valuable information, but what if I want to see the relationships between two or more items? For example, do people buying beer tend to buy potato or corn chips, and does this change by brand of beer or chip? How does this relationship change during the day, week, or even month? Stores have known some of this information for a long time (pizza sells well on Fridays, and chips and beer do well during football season), but looking at individual receipts requires BD.
This kind of insight leads to better-run stores with fewer out-of-stock items and less waste. Stores also can compare the items sold with the items on sale, and therefore learn more about their customers and develop more effective ways to communicate with them through different advertising. In fact, customer loyalty cards give stores the ability to know even more about their customers based on multiple visits, and to find ways to better serve them.
An example of twenty items is easy, but imagine this at the size of a typical supermarket, with hundreds of items and multiple locations. Here is where BD can really add value: in converting data to information to knowledge and, finally, to wisdom.
Now, let’s take this data to a whole other level: the daily traffic flow in your city.
My kids don’t believe me, but the only way we used to know if there was a wreck or traffic backup was if the radio reporter in the helicopter reported it. Now, if you watch the news, you can see a real-time map showing the average speed on all the major roadways. This data is coming from the cell towers collecting and reporting the movements and directions when a given cell phone signal moves from tower to tower. As long as your phone is on, your data is part of the collection.
You can see in this example that volume is high and velocity is fast as each cell phone moves from tower to tower, but we also have the issues of veracity and variety to consider.
Your cell phone is also supplying this data when you are at home, out walking, or somewhere other than in your car driving. For veracity of traffic flows, these kinds of records need to be excluded as much as possible without filtering out the results of when traffic is stopped due to a traffic jam. You also need to factor in the case when multiple phones are in a single vehicle, such as in a car pool, city bus, or train. If you don’t have a cell phone or it is turned off, your car is not seen in the data, which also reduces the accuracy of the BD.
As far as variety, because there are multiple cell phone companies and even more phone models, we need to be able to accept the data in differing formats flowing into a common tool, and it is unlikely we will be able to make the data source change to fit our needs.
To process this kind of data quickly into useful information requires specialized systems and tools—many with interesting names such as Splunk, Hadoop, Pig, Hive, and NoSQL, for example.
To collect the BIDW—in this case, traffic information—cities have traditionally used a process where hoses are placed on the road and a counter box counts the times the hose is driven over during the period measured. This method does not get false information based on more than one (or no) phone in a car, the numbers can be cheaply collected and analyzed over time, and it provides valuable information, such as daily or hourly traffic counts and average speed at an intersection.
The Future of Data
BD and BIDW are also the stuff of controversy. Many people get concerned that too much personal information is being collected by businesses and the government, and that this information could be used in less positive ways. Yes, as systems gain the ability to hold ever-growing amounts of information, the risk increases that someone will use this information maliciously. But the more we know about how this type of data collection works, the better we can do at determining what the rules need to be for its use.
The rate of data growth is explosive, and it will only increase as more and more of our lives become digital (think real-time electric or water meters, smart refrigerators, etc.). This also creates tremendous career growth for people who want to be part of the data revolution.
BD can also easily lead to information overload, and generating massive reports that don’t answer the question asked—or provide the wrong conclusions based on flawed assumptions—aren’t helpful. Pretty graphs are nice, but you may lose credibility if all you do is provide information for information’s sake. BD workers need to understand not only how to collect and process the data available, but also how and what information can be gained by processing this data in order to get to the wisdom part of the equation.
I see big data as where the Internet was in the ’90s: It’s growing very fast, but nobody knows what it will look like in ten years.