The current surge of enthusiasm around big data has produced a predictable backlash. Some of it, like Gary Marcus's New Yorker post "Steamrolled by Big Data," is insightful and well-reasoned (even though I have my quibbles with some of his points). This is not surprising, since he's a neuroscientist as well as a writer, and so quite comfortable with data.
Unfortunately, some other prominent commentators clearly aren't. David Brooks has taken up big data in his New York Times column recently, and literary lion Leon Wieseltier posted last month in The New Republic about "What Big Data Will Never Explain." Now, these guys are entitled to write about what whatever they like, but if they want to be taken seriously when discussing data they really should stop the kinds of elementary mistakes they've been making so far. Their errors of understanding and fact weaken their credibility and turn off quantitatively adept readers.
So as a public service here's a short list, written for non-quant-jock pundits, of things to keep in mind always when writing about data and its uses.
Absolute Certainty is Not the Goal (Because It's Impossible). Wieseltier writes that "The purpose of this accumulated information is to detect patterns that will enable prediction: a world with uncertainty steadily decreasing to zero, as if that is a dream and not a nightmare." Everything in that sentence up to the colon is accurate; after it comes nonsense. When teaching introductory probability, I tell my students that a random variable (the mathematical workhorse of the data disciplines) is one where even after you know everything there is to know about it, you still don't know everything. For example, you know a fair coin toss will come up heads 50% of the time and tails 50%; that's it, and that's a long, long way from zero uncertainty.
Data geeks desperately want to make better predictions using the seas of digital information available today. They want to know how many games the Red Sox will win this season, what course of treatment will zap that particular cancer, and whether they'll beat the dealer on the next hand. They know they'll never know any of these things for sure, and that zero uncertainty isn't even a meaningful goal to discuss.
People are Not Inherently Better at Making Decisions, Predictions, Judgments, and Diagnoses. Brooks thinks that they are. He writes that "Data struggles with the social," "Data struggles with context," and "Data creates bigger haystacks" (apparently, when it comes to data knowing more about a topic is bad) while on the other hand "The human brain has evolved to account for this reality. People are really good at telling stories that weave together multiple causes and multiple contexts."
And this is exactly the problem. The stories we tell ourselves are very often wrong, and we have a host of biases and other glitches in our mental wiring that keep us from sizing up a situation correctly.
How many of these glitches are there? I don't think anyone knows for sure. The best catalog I've come across so far is Rolf Dobelli's The Art of Thinking Clearly, which devotes a separate short chapter to each mental misfire he's identified. The book has 99 chapters.
The late Paul Meehl and William Grove analyzed 136 research studies directly comparing the predictions of humans, many of them 'experts,' against those coming exclusively from data and algorithms. Humans were clearly better in only 8 of the cases, giving them a batting average of .058. And Meehl and Grove hypothesize that those 8 human victories might have been due to the fact that the people were "provided with more data than the actuarial formula."
Quantification is Useful in Every Field of Inquiry. Viktor Mayer-Schönberger and Kenneth Cukier say in their new book Big Data: A Revolution That Will Transform How We Live, Work, and Think that "Datafication represents an essential enrichment in human comprehension." Wieseltier reacts "It is this inflated claim that gives offense... The religion of information is another superstition, another distorting totalism, another counterfeit deliverance" But I don't hear the two authors attempting to found a new religion around information; I hear them making the entirely reasonable claim that better, more precise measurement is a really valuable advance. The field of biology was transformed by Anton van Leeuwenhoek's microscope, which for the first time gave us the ability to see, count, and otherwise measure the tiny entities that exist at a different scale than we do. This led to a reduction in superstition, not an increase.
The great Victorian scientist Lord Kelvin laid down a general rule: "[W]hen you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of Science, whatever the matter may be."
Wieseltier might respond that some fields of inquiry aren't 'science' and should never be, but that response would be ridiculous. Science here is simply the process of testing claims against evidence. The ones resisting this are just about guaranteed to be the ones with the flimsiest claims.
Big Data's Advocates Don't Think Everything Can (and Should) be Turned Over to Computers. Brooks says that "If you asked me to describe the rising philosophy of the day, I'd say it is data-ism...; that data will help us do remarkable things — like foretell the future." Wieseltier takes the same idea a lot further: "in the comprehensively quantified existence in which we presume to believe that eventually we will know everything, in the expanding universe of prediction in which hope and longing will come to seem obsolete and merely ignorant, we are renouncing some of the primary human experiences."
I've been talking and hanging out with a lot of data geeks over the past months and even though they're highly ambitious people, I've never heard any of them express anything like those sentiments and goals. In fact, they're very circumspect when they talk about their work. They know that the universe is a ridiculously messy and complex place and that all we can do is chip away at its mysteries with whatever tools are available, our brains always first and foremost among them.
The geeks are excited these days because in the current era of Big Data the tools just got a whole lot better. If someone told them if their goal was to make hope and longing obsolete and merely ignorant they'd probably find a way to turn such an ignorant statement into a brilliantly nasty visual meme, post it on Reddit, and get back to work.