Conferences like Strata are planned a year in advance. The logistics and coordination required for an event of this magnitude takes a lot of planning, but it also takes a decent amount of prediction: Strata needs to skate to where the puck is going.
While Strata New York + Hadoop World 2013 is still a few months away, we’re already guessing at what next year’s Santa Clara event will hold. Recently, the team got together to identify some of the hot topics in big data, ubiquitous computing, and new interfaces. We selected eleven big topics for deeper investigation.
- Deep learning
- Time-series data
- The big data “app stack”
- Cultural barriers to change
- Design patterns
- Laggards and Luddites
- The convergence of two databases
- The other stacks
- Mobile data
- The analytic life-cycle
- Data anthropology
Here’s a bit more detail on each of them.
Teaching machines to think has been a dream/nightmare of scientists for a long time. Rather than teaching a machine explicitly, or using Watson-like statistics to figure out the best answer from a mass of data, Deep Learning uses simpler, core ideas and then builds upon them — much as a baby learns sounds, then words, then sentences.
It’s been applied to problems like vision (find an edge, then a shape, then an object) and better voice recognition. But advances in processing and algorithms are making it increasingly attractive for a large number of challenges. A Deep Learning model “copes” better with things its creators can’t foresee, or genuinely new situations. A recent MIT Technology Review article said these approaches improved image recognition by 70%, and improved Android voice recognition 25%. But 80% of the benefits come from additional computing power, not algorithms, so this is stuff that’s only become possible with the advent of cheap, on-demand, highly parallel processing.
The main drivers of this approach are big companies like Google (which acquired DNNResearch), IBM and Microsoft. There are also startups in the machine learning space like Vicarious and Grok (née Numenta).
Deep Learning isn’t without its critics. Something learned in a moment of pain or danger might not be true later on, so the system needs to unlearn — or at least reduce the certainty — of a conclusion. What’s more, certain things might only be true after a sequence of events: once we’ve seen a person put a ball in a box and close the lid, we know there is a ball in the box, but a picture of the box afterward wouldn’t reveal this. Inability to take into account time is one of the criticisms Grok founder Jeff Hawkins levels at Deep Learning.
There’s some good debate, and real progress in AI and machine learning, as a result of the new computing systems that make these models possible. They’ll likely supplant the expert systems (yes/no trees) that are used in many industries, but have fundamental flaws. Ben Goldacre described this problem at Strata in 2012: almost every patient who displays the symptoms of a rare disease instead has two, much more common, diseases with those symptoms.
Also this is why House is a terrible doctor’s show.
In 2014, much of the data science content of Strata will focus on making machines smarter, and much of this will come from abundant back-end processing paired with advances in vision, sensemaking, and context.
Data is often structured according to the way it will be used.
- To data designers, a graph is a mathematical structure that describes how a pair of objects relate to one another. This is why Facebook’s search tool is called Graph Search. To work with large numbers of relationships, we use a Graph database that organizes everything in it according to how it’s related to everything else. This makes it easy to find things that are linked to one another, like routers in a network or friends at a company, even with millions of connections. As a result, it’s often in the core of a social network’s application stack. Companies like Neo4j and Titan and Vertex make them.
- On the other hand, a relational database keeps several tables of data (your name; a product purchase) and then links them by a common thread (such as the credit card used to buy the product, or the name of the person to whom it belongs). When most traditional enterprise IT people say “database,” they mean a relational database (RDBMS). The RDBMS has been so successful it’s supplanted most other forms of data storage.
(As a sidenote, at the core of the RDBMS is a “join,” an operation that links two tables. Much of the excitement around NoSQL databases was in fact about doing away with the join, which — though powerful — significantly restricts how quickly and efficiently an RDBMS can process large amounts of data. Ironically, the dominant language for querying many of these NoSQL databases through tools like Impala is now SQL. If the NoSQL movement had instead been called NoJoin, things might have been much more clear.)Data systems are often optimized for a specific use.
- Think of a coin-sorting machine — it’s really good at organizing many coins of a limited variety (nickels, dimes, pennies, etc.).
- Now think of a library — it’s really good at a huge diversity of books, often only one or two of each, and not very fast.
Databases are the same: a graph database is built differently from a relational database; an analytical database (used to explore and report on data) is different from an operational one (used in production).
Most of the data in your life — from your Facebook feed to your bank statement — has one common element: time. Time is the primary key of the universe.
Since time is often the common thread in data, optimizing databases and processing systems to be really, really good at handling data over time is a huge benefit for many applications, particularly those that try to find correlations between seemingly different data — does the temperature on your NEST thermostat correlate with an increase in asthma inhaler use? Black Swans aside, time is also useful when trying to predict the future from the past.
Time Series data is at the root of life-logging and the Quantified Self movement, and will be critical for the Internet of Things. It’s a natural way to organize things which, as humans, we fundamentally understand. Time series databases have a long history, and there’s a lot of effort underway to modernize them as well as the analytical tools that crunch the data they contain, so we think time-series data deserves deeper study in 2014.
The Big Data app stack
We think we’re about to see the rise of application suites for big data. Consider the following evolution:
- On a mainframe, the hardware, operating system, and applications were often indistinguishable.
- Much of the growth of consumer PCs happened because of the separation of these pieces — companies like Intel and Phoenix made the hardware; Microsoft and Unix made the OS; and developers like WordPerfect, Lotus, and DBase made the applications.
- Eventually, we figured out what the PC was “for” and it acquired a core set of applications without which, it seems, a PC wouldn’t be useful. Those are generally described as “office suites,” and while there was once a rivalry for them, today, they’ve been subsumed by OS makers (Apple, Microsoft, Open Source) while those that didn’t have an OS withered on the vine (Corel).
- As we moved onto the web, the same thing started to happen — email, social network, blog, and calendar seemed to be useful online applications now that we were all connected, and the big portal makers like Google, Sina, Yahoo, Naver, and Facebook made “suites” of these things. So, too, did the smartphone platforms, from PalmPilot to Blackberry to Apple and Android.
- Today’s private cloud platforms are like yesterday’s operating systems, with OpenStack, CloudPlatform, VMWare, Eucalyptus, and a few others competing based on their compatibility with public clouds, hardware, and applications. Clouds are just going through this transition to apps, and we’re learning that their “app suite” includes things like virtual desktops, disaster recovery, on-demand storage — and of course, big data.
Okay, enough history lesson.
We’re seeing similar patterns emerge in big data. But it’s harder to see what the application suite is before it happens. In 2014, we think we’ll be asking ourselves, what’s the Microsoft Office of Big Data? We can make some guesses:
- Predicting the future
- Deciding what people or things are related to other people or things
- Helping to power augmented reality tools like Google Glass with smart context
- Making recommendations by guessing what products will appeal to which customers
- Optimizing bottlenecks in supply chains or processes
- Identifying health risks or anomalies worthy of investigation
Companies like Wibidata are trying to figure this out — and getting backed by investors with deep pockets. Just as most of the interesting stories about operating systems were the apps that ran on them, and the stories about clouds are things like big data, so the good stories about big data are the “office suites” atop it. Put another way, we don’t know yet what big data is for, but I suspect that in 2014 we’ll start to find out.
Cultural barriers to data-driven change
Every time I talk with companies about data, they love the concept but fail on the execution. There are a number of reasons for this:
- Incumbency. Yesterday’s leaders were those who could convince others to act in the absence of information. Tomorrow’s leaders are those who can ask the right questions. This means there is a lot of resistance from yesterday’s leaders (think Moneyball).
- Lack of empowerment. I recently ate a meal in the Pittsburgh airport, and my bill came with a purple pen. I’m now wondering if I tipped differently because of that. What ink colour maximizes per-cover revenues in an airport restaurant? (Admittedly, I’m a bit obsessive.) But there’s no reason someone couldn’t run that experiment, and increase revenues. Are they empowered to do so? How would they capture the data? What would they deem a success? These are cultural and organizational questions that need to be tackled by the company if it is to become data-driven.
- Risk aversion. Steve Blank says a startup is an organization designed to search for a scalable, repeatable business model. Here’s a corollary: a big company is one designed to perpetuate a scalable, repeatable business model. Change is not in its DNA — predictability is. Since the days of Daniel McCallum, organizational charts and processes fundamentally reinforce the current way of doing things. It often takes a crisis (such as German jet planes in World War Two or Netscape’s attack on Microsoft) to evoke a response (the Lockheed Martin Skunk Works or a free web browser).
- Improper understanding. Correlation is not causality — there is a correlation between ice cream and drowning, but that doesn’t mean we should ban ice cream. Both are caused by summertime. We should hire more lifeguards (and stock up on ice cream!) in the summer. Yet many people don’t distinguish between correlation and causality. As a species, humans are wired to find patterns everywhere because a false positive (turning when we hear a rustle in the bushes, only to find there’s nothing there) is less dangerous than a false negative (not turning and getting eaten by a sabre-toothed tiger).
- Focus on the wrong data. Lean Analytics urges founders to be more data-driven and less self-delusional. But when I recently spoke with executives from DHL’s innovation group, they said that innovation in a big company requires a wilful disregard for data. That’s because the preponderance of data in a big company reinforces the status quo; nascent, disruptive ideas don’t stand a chance. Big organizations have all the evidence they need to keep doing what they have always done.
There are plenty of other reasons why big organizations have a hard time embracing data. Companies like IBM, CGI, and Accenture are minting money trying to help incumbent organizations be the next Netflix and not the next Blockbuster.
What’s more, the advent of clouds, social media, and tools like PayPal or the App Store has destroyed many of the barriers to entry on which big companies rely. As Quentin Hardy pointed out in a recent article, fewer and fewer big firms stick around for the long haul.
As any conference matures, we move into best practices. The way these manifest themselves with architecture is in the form of proven architectures — snippets of recipes people can re-use. Just as a baker knows how to make an icing sauce with fat and sugar — and can adjust it to make myriad variations — so, too, can an architect use a particular architecture to build a known, working component or service.
As Mike Loukides points out, a design pattern is even more abstract than a recipe. It’s like saying, “sweet bread with topping,” which can then be instantiated in any number of different kinds of cake recipes. So, we have a design pattern for “highly available storage” and then rely on proven architectural recipes such as load-balancing, geographic redundancy, and eventual consistency to achieve it.
Such recipes are well understood in computing, and they eventually become standards and appliances. We have a “scale-out” architecture for web computing, where many cheap computers can handle a task, as an Application Delivery Controller (a load balancer) “sprays” traffic across those machines. It’s common wisdom today. But once, it was innovative. Same thing with password recovery mechanisms and hundreds of other building blocks.
We’ll see these building blocks emerge for data systems that meet specific needs. For example, a new technology called homomorphic encryption allows us to analyze data while it is still encrypted, without actually seeing the data. That would, for example, allow us to measure the spread of a disease without violating the privacy of the individual patients. (We had a presenter talk about this at DDBD in Santa Clara.) This will eventually become a vital ingredient in a recipe for “data where privacy is maintained.” There will be other recipes optimized for speed, or resiliency, or cost, all in service of the “highly available storage” pattern.
This is how we move beyond vendors. Just as a scale-out web infrastructure can have an ADC from Radware, Citrix, F5, Riverbed, Cisco, and others (with the same pattern), we’ll see design patterns for big data with components that could come from Cloudera, Hortonworks, IBM, Intel, MapR, Oracle, Microsoft, Google, Amazon, Rackspace, Teradata, and hundreds of others.
Note that many vendors who want to sell “software suites” will hate this. Just as stereo vendors tried to sell all-in-one audio systems, which ultimately weren’t very good, many of today’s commercial providers want to sell turnkey systems that don’t allow the replacement of components. Design patterns and the architectures on which they rely are anathema to these closed systems — and are often where standards tracks emerge. 2014 is when that debate will start out in Big Data.
Laggards and Luddites
Certain industries are inherently risk-averse, or not technological. But that changes fast. A few years ago, I was helping a company called FarmsReach connect restaurants to local farmers and turn the public market into a supply chain hub. We spent a ton of effort building a fax gateway because farmers didn’t have mobile phones, and ultimately, the company pivoted to focus on building networks between farmers.
As the cost of big data drops and the ease of use increases, we’ll see it applied in many other places. Consider, for example, a city that can’t handle waste disposal. Traditionally, the city would buy more garbage trucks and hire more garbage collectors. But now, it can analyze routing and find places to optimize collection. Unfortunately, this requires increased tracking of workers — something the unions will resist very vocally. We already saw this in education, where efforts to track students were shut down by teachers’ unions.
In 2014, big data will be crossing the chasm, welcoming late adopters and critics to the conversation. It’ll mean broadening the scope of the discussion — and addressing newfound skepticism — at Strata.
Convergence of two databases
If you’re running a data-driven product today, you typically have two parallel systems.
- One’s in production. If you’re an online retailer, this is where the shopping cart and its contents live, or where the user’s shipping address is stored.
- The other’s used for analysis. An online retailer might make queries to find out what someone bought in order to handle a customer complaint or generate a report to see which products are selling best.
Analytical technology comes from companies like Teradata, IBM (from the Cognos acquisition), Oracle (from the Hyperion acquisition), SAP, and independent Microstrategy, among many others. They use words like “Data Warehouse” to describe these products, and they’ve been making them for decades. Data analysts work with them, running queries and sending reports to corporate bosses. A standalone analytical data warehouse is commonly accepted wisdom in enterprise IT.
But those data warehouses are getting faster and faster. Rather than running a report and getting it a day later, analysts can explore the data in real time — re-sorting it by some dimension, filtering it in some way, and drilling down. This is often called pivoting, and if you’ve ever used a Pivot Table in Excel you know what it’s like. In data warehouses, however, we’re dealing with millions of rows.
At the same time, operational databases are getting faster and sneakier. Traditionally, a database is the bottleneck in an application because it doesn’t handle concurrency well. If a record is being changed in the database by one person, it’s locked so nobody else can touch it. If I am editing a Word document, it makes sense to lock it so someone else doesn’t edit it — after all, what would we do with the changes we’d both made?
But that model wouldn’t work for Facebook or Twitter. Imagine a world where, when you’re updating your status, all your friends can’t refresh their feeds.
We’ve found ways to fix this. When several people edit a Google Doc at once, for instance, each of their changes is made as a series of small transactions. The document doesn’t really exist — instead, it’s a series of transactional updates, assembled to look like a document. Similarly, when you post something to Facebook, those changes eventually find their way to your friends. The same is true on Twitter or Google+.
These kinds of eventually consistent approaches make concurrent editing possible. They aren’t really new, either: your bank statement is eventually consistent, and when you check it online, the bottom of the statement tells you that the balance is only valid up until a period in the past and new transactions may take a while to post. Here’s what mine says:
Transactions from today are reflected in your balance, but may not be displayed on this page if you recently updated your bankbook, if a paper statement was issued, or if a transaction is backdated. These transactions will appear in your history the following business day.
Clearly, if eventual consistency is good enough for my bank account, it’s good enough for some forms of enterprise data.
So, we have analytical databases getting real-time fast and operational databases increasingly able to do things concurrently without affecting production systems. Which begs the question: why do we have two databases?
This is a massive, controversial issue worth billions of dollars. Take, for example, EMC, which recently merged its Greenplum acquisition into Pivotal. Pivotal’s marketing (“help customers build, deploy, scale, and analyze at an unprecedented velocity”) points at this convergence, which may happen as organizations move their applications into cloud environments (which is partly why Pivotal includes Cloud Foundry, which VMWare acquired).
The change will probably create some huge industry consolidation in the coming years (think Oracle buying Teradata, then selling a unified operational/analytical database). There are plenty of reasons it’s a bad idea, and plenty of reasons why it’s a good one. We think this will be a hot topic in 2014.
Cassandra and the other stacks
Big data has been synonymous with Hadoop. The break-out success of the Hadoop ecosystem has been astonishing, but it does other stacks a disservice. There are plenty of other robust data architectures that have furiously enthusiastic tribes behind them. Cassandra, for example, was created by Facebook, released into the wild, and tamed by Reddit to allow the site to scale to millions of daily visitors atop Amazon with only a handful of employees. MongoDB is another great example, and there are dozens more.
Some of these stacks got wrapped around the axle of the NoSQL debate, which, as I mentioned, might have been better framed as NoJoin. But we’re past that now, and there are strong case studies for many of the stacks. There are also proven affinities between a particular stack (such as Cassandra) and a particular cloud (such as Amazon Web Services) because of their various heritages.
In 2014, we’ll be discussing more abstract topics and regarding every one of these stacks as a tool in a good toolbox.
By next year, there will be more mobile phones in the world than there are humans, over one billion of them “smart.” They are the closest thing we have to a tag for people. Whether measuring mall traffic for shoppers or projecting the source of Malarial outbreaks in Africa, it’s big. One carrier recently released mobile data from the Ivory Coast to researchers.
Just as Time Series data has structure, so does geographic data, much of which lives in Strata’s Connected World track. Mobile data is a precursor to the Internet of Everything, and it’s certainly one of the most prolific structured data sources in the world.
I think concentrating on mobility is critical for another reason, too. The large systems created to handle traffic for the nearly 1,000 carriers in the world are big, fast, and rock solid. An AT&T 5ESS switch or one of the large-scale Operational Support Systems, simply does not fall over.
Other than DNS, the Internet doesn’t really have this kind of industrial-grade system for managing billions of devices, each of which can connect to the others with just a single address. That is astonishing scale, and we tend to ignore it as “plumbing.” In 2014 , the control systems for the Internet of Everything are as likely to come from Big Iron made by Ericsson as they are to come from some Web 2.0 titan.
The analytic life-cycle
The book The Theory That Would Not Die begins with a quote from John Maynard Keynes: “When the facts change, I change my opinion. What do you do, sir?” As this New York Times review of the book observes, “If you are not thinking like a Bayesian, perhaps you should be.”
Bayes’ theorem says that beliefs must be updated based on new evidence — and in an information-saturated world, new evidence arrives constantly, which means the cycle turns quickly. To many readers, this is nothing more than explaining the scientific method. But there are plenty of people who weren’t weaned on experimentation and continuous learning — and even those with a background in science make dumb mistakes, as the Boy Or Girl Paradox handily demonstrates.
Ben Lorica, O’Reilly’s chief scientist (and owner of the enviable Twitter handle @bigdata) recently wrote about the lifecycle of data analysis. I wrote another piece on the Lean Analytics cycle with Avinash Kaushik a few months ago. In both cases, it’s an iterative process of hypothesis-forming, experimentation, data collection, and readjustment.
In 2014, we’ll be spending more time looking at the whole cycle of data analysis, including collection, storage, interpretation, and the practice of asking good questions informed by new evidence.
Data seldom tells the whole story. After flooding in Haiti, mobile phone data suggested people weren’t leaving one affected area for a safe haven. Researchers concluded that they were all sick with cholera and couldn’t move. But by interviewing people on the ground, aid workers found out the real problem was that flooding had destroyed the roads, making it hard to leave.
As this example shows, there’s no substitute for context. In Lean Analytics, we say “Instincts are experiments. Data is proof.” For some reason this resonated hugely and is one of the most favorited/highlighted passages in the book. People want a blend of human and machine, of soft, squishy qualitative data alongside cold, hard quantitative data. We joke that in the early stages of a startup, your only metric should be “how many people have I spoken with?” It’s too early to start counting.
In Ash Maurya’s Running Lean, there’s a lot said about customer development. Learning how to conduct good interviews that don’t lead the witness and measuring the cultural factors that can pollute data is hugely difficult. In The Righteous Mind, Jonathan Haidt says all university research is WEIRD: Western, Educated, Industrialized, Rich, and Democratic. That’s because test subjects are most often undergraduates, who fit this bill. To prove his assertion, Haidt replicated studies done on campus at a McDonald’s a few miles away, with vastly different results.
At the first Strata New York, I actually left the main room one morning to go write a blog post. I was so overcome by the examples of data errors — from bad collection, to bad analysis, to wilfully ignoring the results of good data — that it seemed to me we weren’t paying attention to the right things. If “Data is the new Oil,” then its supply chain is a controversial XL pipeline with woefully few people looking for leaks and faults. Anthropology can fix this, tying quantitative assumptions to verification.
So, data anthropology can ensure good data collection, provide essential context to data, and check that the resulting knowledge is producing the intended results. That’s why it’s on our list of hot topics for 2014.