Over the last few years I've created a few popular visualizations, a lot of duds, and I've learned a few lessons along the way. For my latest analysis of where Facebook users go on vacation, I decided to document the steps I follow to build my visualizations . It's a very rough guide, these are just stages I've learned to follow by trial and error, but following these guidelines is a good way to start if you're looking to create your first visualization.
Play with your data
I was lucky enough to spend a few hours with Andreas Weigend recently, head of the Stanford Social Data lab. He has nine rules of data, and the first is "Start with the problem, not the data." What struck me about visualizations is that I actually take the opposite approach. I find the only way to begin is to explore what information is available and get a feeling for what stories it can tell.
In my case, we have a Cassandra cluster with information on more than 350 million photos shared on Facebook. I've been running Pig analytics jobs regularly to get a view of what we have in there. One of the reports we generate is a count of how many photos and users we have for particular places:
I was chatting with my colleague Chris Raynor about this, and he asked me if we could tell where all the visitors to those places were coming from. This was something that had been at the back of my mind for a long time. Seeing how much information we had on each destination made me realize we had enough data to produce significant and meaningful answers.
When I was learning engineering, one of my favorite case studies was an investigation into an air-traffic control system. Software engineers couldn't understand why fully-computerized control rooms were actually less efficient and safe than more old-fashioned sites. What the researchers discovered was that the old process of passing around and arranging small cards that each represented a plane gave controllers a much stronger awareness of the situation than a screen that didn't require their involvement for tasks, such as handing an aircraft to a colleague. I think the same is true of data. The more time you spend manipulating and examining the raw information, the more you understand it at a deep level. Knowing your data is the essential starting point for any visualization.
Pick a question
Now that I had a rough idea for what I wanted to visualize, I really needed to focus on what I would be doing. The best way to do that is to chose the exact title you want to give your visualization. I actually messed this up on one early map I created, giving the blog post the title "How to split up the US." Everyone subsequently described it as "The Five Nations of Facebook." Since then, I've tried very hard to pick the most natural title for what I'm going to be presenting, and then ensure I can deliver on the promise of the headline.
In this case I had a clear idea of the question at the start, it was going to be "Where do people go on vacation?". However, as I thought about it, I realized it needed to be a lot more specific and concrete. There's already a lot of "top travel destinations" lists out there, so what made mine different? It was the use of Facebook to gather much richer and more detailed information, so I refined it to "Where do Facebook users go on vacation?".
Sketch out your presentation
I now had the data and a question I wanted to answer. The next step was figuring out how to show the information in a visual form. I'm in love with network diagrams showing connections between thousands of objects, but so often they are completely baffling to the rest of the world. I still remember David Cohen threatening to strangle me if I showed him another one of "those damn spider webs" instead of a business plan. However, network diagrams are a good way of hinting at how much data is available for querying; they can really give an idea of the sheer scale of what's there.
One of my favorite recent visualizations was Paul Butler's map of friendships on Facebook, so I decided to use that as a visual reference:
I borrowed a couple of key ideas from his work: the general color palette of the blue lines on a dark background and the use of great circles to create flowing arcs for all connections.
As I thought about the presentation, I realized that I had to simplify what it would be showing. With sources and destinations plotted all over the world, both the visual look and the querying interface would be overwhelming. Our user-base is primarily American thanks to our reliance on English-only natural language processing, so with that in mind I decided to make life simpler by only showing data from people who lived in the U.S. Accordingly, I changed the question in my title to "Where do American Facebook users go on vacation?".
While I'm mostly presenting this as a linear, waterfall process, what I've just described is a good example of how iterative cycles drive the real workflow. It's hard to know how well a lot of things will work until you try them. As you're still making some progress, don't worry if you find yourself going in circles.
Crunch the data
If you know your data, and you have a good idea of the question you're trying to answer, this should be the simplest stage. You'll hopefully have a clear set of requirements and it's just a matter of executing the right queries over your data.
In this case I already had some Pig scripts asking similar questions, so I was able to adapt one of those. The biggest surprise was when I ran into issues with some of the joins. The hard part was running the Hadoop job to gather the raw data from our Cassandra cluster, and that worked. I was able to output smaller files containing the gathered data, and then run a local Pig job to do the joins I needed.
The next stage was turning the raw information into a form that could be displayed. For example, I needed to take all of the user locations from the unstructured text strings that Facebook gave me, and convert them into latitude-longitude coordinates for plotting on a map. For this sort of work I usually turn to a general-purpose scripting language, and most of Jetpac is already written in Ruby, so that was an easy choice. I wrote a script that walked through the data, using the Data Science Toolkit to match coordinates with names, and then output it into a file containing a JSON array of all the information.
Build an interface
A lot of the best visualizations have no interactivity. They just tell a story with a static image. That's why it's worth considering whether you need an interface at all. I actually had the interactive site that I used to create the "Five Nations of Facebook" visualization up for several weeks before that post, and nobody used it because it was too confusing. It was only when I boiled it down into a single picture with labels that it became a hit.
My problem is that I want other people to have as much fun exploring the data as I've had, so I couldn't resist adding some interaction to the vacation visualization. I still wanted to retain the immediate visual appeal of a static image, so I decided to create a background showing the full data to introduce the visualization at a first glance, and then overlay an interactive foreground once the user started exploring it more deeply.
I then tied in rendering the connections of any places that the user was hovering their cursor over, so that they could quickly get a feel for the relationships expressed in the data. I also wanted to display the details underlying the picture, so to drill down I added a dialog listing the raw statistics about a place. Users can bring this dialog up by clicking.
One problem with that interaction is that a lot of different cities are in a very small area, so it becomes extremely difficult to pick the one you want with the mouse cursor. To make that a little better, I prioritized the most popular U.S. cities so that in case of a conflict, they're chosen over their smaller neighbors. I realized I also needed to add a search box. Thankfully we're heavy users of Twitter's Bootstrap framework, so it was a simple matter to add a search field and tie it in with Twitter's excellent autocomplete component.
Find the surprises!
I build these visualizations so I can explore them myself, so my favorite part of the whole process is the chance to sit and play with the results. There's always unexpected stories hidden in there, and I love uncovering them. For example, who knew that the city that had the most visitors to Paris was West Hollywood? When I lived in Los Angeles I used to love popping by the wonderful patisseries. Now I know why they're so good! These little details are the stories that catch people's imagination and cause them to spread the word, so think about writing a few of them up to help visitors understand what the page can tell them.
You'll never know whether one of your visualizations will become popular ahead of time, but the real reward is enjoying your own work. I hope this short guide gives you some ideas for visualizations you want to build. I look forward to seeing what you come up with.