One can't help being impressed with the effort biologists, physicists, and other scientists devote to data quality. From careful design of experiments and data collection processes, to explicit definition of terms, to comprehensive efforts to ensure the data are correct, no effort is spared. This is not surprising. After all, data are the lifeblood of science.
Increasingly, data are also the lifeblood of business and government. And the attention science pays to data quality provides important lessons, especially for those interested in "Big Data."
Simply put, bad data make everything about Big Data — from discovering something truly novel, to building a product or service around that discovery, to monetizing the discovery — more difficult. The two most important problems are:
- The data are poorly defined, leading to incorrect interpretations.
- The data are simply wrong, incomplete, or out-of-date, leading to problems throughout.
Worse, in business bad data can be downright dangerous. Consider that throughout the mid-2000s, financial companies did a terrific job slicing, dicing, and packaging risk in creating collateralized debt obligations (CDOs). But they either didn't know or didn't care that too much of the mortgage data used to create them were wrong. Eventually, of course, the bad data asserted themselves. And the financial system nearly collapsed.
Early computer programmers recognized that having bad data was an issue and coined the expression, "garbage in, garbage out." The Big Data update is "big garbage in, big, TOXIC garbage out."
This example and observation underscore a very critical point: No matter what, do not underestimate the data quality problem, nor the effort required to solve it. You must get in front of data quality. You can systematically improve data by following these recommendations, inspired by the best scientific traditions and the efforts of leading companies to translate those traditions into business practice. To start, think of data quality problems as falling into two categories, each requiring a different approach.
Address preexisting issues. There are some problems that have been created already, and you have no choice but to address these before you use the data in any serious way. This is time-consuming, expensive, and demanding work. You must make sure you understand the provenance of all data, what they truly mean, and how good they are. In parallel, you must clean the data. When I was at Bell Labs in the 1980s and '90s, we used the expression "rinse, wash, scrub" for increasingly sophisticated efforts to find and correct errors (or at least eliminate them from further analyses). For Big Data, a complete rinse, wash, and scrub may prove infeasible. An alternative is to complete the rinse, wash, scrub cycle for a small sample, repeat critical analyses using these "validated" data, and compare results. To be clear, this alternative must be used with extreme caution!
But simply cleaning up erred data is not enough. The sheer quantity of new data being created or coming in is growing too rapidly to keep up. Over the long term, the only way to deal with data quality problems is to prevent them.
Prevent the problems that haven't happened yet. Here is where scientific traditions of "getting close to the data" and "building quality in" are most instructive for Big Data practitioners. I've already mentioned the care scientists take to design their experiments, the efforts they make to define terms, and the lengths they go to understand end-to-end data collection. They also build controls (such as calibrating test equipment) into data collection; identify and eliminate the root causes of error; and upgrade equipment every chance they get. They keep error logs and subject their data to the scrutiny of their peers. This list can go on and on.
Those pursuing Big Data must adapt these traditions to their circumstances. Most really important data are used for many things (not just Big Data analyses), so you must specify the different needs of people who use them. Since the data originate from many sources, you must assign managers to cross-functional processes and to important external suppliers, and ensure that data creators understand what is expected. You must measure quality, build in controls that stop errors in their tracks, and apply Six Sigma and other methods to get at root causes. You must recognize that everyone touches data and can impact quality, so you must engage them in the effort.
Interestingly, once you get the hang of it, none of the work to prevent errors is particularly difficult. But too many organizations don't muster the effort. There are dozens of reasons — excuses, really — from the belief that "if it is in the computer, it must be the responsibility of IT," to a lack of communication across silos, to blind acceptance of the status quo. While I don't want to minimize these issues, none stand up to scrutiny.
As I've opined elsewhere, it is time for senior leaders to get very edgy about data quality, get the managerial accountabilities right, and demand improvement. For bad data don't just bedevil Big Data. They foul up everything they touch, adding costs to operations, angering customers, and making it more difficult to make good decisions. The symptoms are sometime acute, but the underlying problem is chronic. It demands an urgent and comprehensive response. Especially by those hoping to succeed with Big Data.
- Can You Live Without a Data Scientist?
- Ignore Costly Market Data and Rely on Google Instead? An HBR Management Puzzle
- Three Questions to Ask Your Advanced-Analytics Team
- The Military's New Challenge: Knowing What They Know