Separating the Wheat from the Chaff

Ushahidi
Jan 27, 2010

The purpose of Swift River is to validate crowdsourced information in near real-time. In the past few weeks, Ushahidi has received nearly one hundred thousand reports of incident related to Haiti. The majority of these coming from Twitter. Because crowdsourced information often has varying degrees of accuracy, an application is needed to validate this information in a timely manner, especially for Ushahidi. In order for us to map an incident, we need to know some basic things. Is the information true? Where is it coming from? Has it been submitted before? Has it been verified by someone other than the source itself? Is it actionable (meaning does there need to be follow up response)? Is that response urgent or can it be tempered? In other words how do we filter the wheat from the chaff, the signal from the noise, in ways that allows us to respond to emergencies efficiently? This is the problem Swift River will attempt to solve by acting as the stop-gap between the deluge of information created online, and the Ushahidi crisis mapping platform. Many feeds come in, one 'sanitized' feed goes out. This saves us time and in some cases, saves lives. There are three components to the first phase of Swift River's development: 1. Predictive Tagging The first component focuses on automatically "tagging" text messages and tweets to describe an event using keywords parsed from the content itself. In come cases accompanying meta information will include already tags and location, but for SMS (and often Twitter) this is rarely the case. Thus, if we have other methods of providing additional context, we can at the very least speed up an otherwise tedious process. For instance, with SULSa (our Swift User Location Service) we've automated the process of extracting latitude and longitude from IP address. 2. Verification and Taxonomy The second component crowdsources the veracity of this automatic tagging to ensure that it is correct. Users vote on tags, providing a model of accuracy that can be reapplied to our algorithm. This 'learning' is critical to our improvement as it ultimately speeds up the tagging process. Taxonomy simply refers to the sorting of this content based on what we've learned thus far. This allows us to begin sorting content based on applied tags, location and relevancy. 3. Filtering by Authority and Trust The third component is a learned ‘trust’ algorithm that scores content sources based on user behavior. If we aggregate multiple sources, and content from one source is consistently deemed to be more accurate than content coming from other sources, we can begin assigning scores to these sources. We aren’t interested in algorithms that rank individual content (Digg-like voting algorithms) but rather rating the point of origin. Content (good or bad) tells us a great deal about the source itself and trusted sources can be prioritized, even though all content is eventually reviewed.

These are the first milestones for realizing the Swift River platform. The second set of milestones will involve expanding the sources that Swift River parses to include full text articles and sites like Flickr (photos) and Youtube (video). The third milestone (next blog post in the series) will focus on clustering tags and determining implicit relationship between content from different sources. The fourth milestone will apply innovative techniques to visualize clusters and probability scores in compelling ways. And of course the fifth milestone is the incremental improvement of every component of our system. As we release builds of Swift we'll update this series accordingly, with the things we're working on and (once released) what users can expect next. If you're a developer interested in working with the Swift River team you can reach us here. Jon Gosier and Patrick Meier