Thinking about Data, Fast and Slow

Screen Shot 2013-08-30 at 1.33.35 PM

Last week, an interesting map was making the rounds on Twitter and in the tech community. The visualization, created by Vizual Statistix, used flight data from the air travel intelligence company Amadeus to depict the top ten busiest air travel routes in 2012. Click on the map above to see the full version.

The map was popular because it was counter-intuitive: of the top ten routes, none are in Europe or North America. Instead, the routes are clustered around East Asia, with others in South Africa, Australia, and Brazil.

When I first heard about the map, my immediate thought was “So South Africa has busier air travel than the US?” Based on some reactions on social media, I was not alone. However, that is not what the data is telling us.

Air travel is a network, with flights creating links between airports. The number of flights between airports is the network’s density and, importantly, that density varies geographically. In other words, some areas of the world (e.g. Europe) have more flights to more places than other areas (e.g. Eastern Siberia).

Imagine you and 99 friends separately book flights between two airports in a very dense part of this air travel network — say between New York and Chicago. You will have a plethora of flight options, from direct flights to flights with layovers in Baltimore, Boston, Cleveland, Atlanta (etc…). Thus, you and your 99 friends will more than likely distribute yourselves over many of those different air routes, based on price, airline loyalty, departure time, and other factors.

Now imagine that you and 99 friends separately book flights between two cities in a less dense area of the air network, say between Sydney and Cape Town. You’d quickly notice that many more of your friends fly the same route. In fact, you and your friends will likely all end up having a layover in Johannesburg. Why? Because Cape Town is at the fringes of the air travel network, and so your route options are limited.

Looking at it this way, the air travel route map tells us something different. These are not the busiest routes because air travel in those areas is more popular. Instead, these are the busiest routes because they are choke points in the global air network. Cape Town is a popular destination, but to get there you have to go through Johannesburg. Osaka is a major city, but you have little choice but to go through Tokyo. On the flip side, the reason the US and Europe do not appear on the map is because their air travel networks are dense: people trying to get somewhere have many routes to choose from.

Why am I telling you this? Because at Ushahidi we deal with data — gathering it, cleaning it, managing it, and most important of all, understanding it. Without understanding, there can be no impact, and around here we are all about creating impact.

So over the years we’ve learned to not rush to conclusions with data — whether crowdsourced election monitoring or citizen feedback — but to think carefully about it: how variables are defined, what assumptions are made, and how it is collected. We’ve seen that by approaching data with a critical eye, we can discover the amazing stories waiting to be told.

One Response to “Thinking about Data, Fast and Slow”

  1. Great description of the complexities of data work. It is worth saying that the temporality of the data is also very important…The VS air network map uses simplistic summary stats (total number of passengers for the year 2012) rather than avg/median passengers per flight, per route, between regions, etc. So the network nodes (airports) edges (flights) and weight of edges (numbers of flights) are all part of the equation. However large cites are both higher populated as well as more dense—so they are better connected but most have a number of airports including in nearby regions far away from the metro area. The VS air route map is flat (one dimesnional) meaning that it is a snapshot of a period of time based on single metrics (sum of passengers) and apparently at the city level. Air travel is largely time-based so a better way to visualise it might involve something like a heat map based on median numbers of passengers per day by region, not city/metro.