Skip to content


Separating the Wheat from the Chaff

The purpose of Swift River is to validate crowdsourced information in near real-time. In the past few weeks, Ushahidi has received nearly one hundred thousand reports of incident related to Haiti. The majority of these coming from Twitter. Because crowdsourced information often has varying degrees of accuracy, an application is needed to validate this information in a timely manner, especially for Ushahidi. In order for us to map an incident, we need to know some basic things. Is the information true? Where is it coming from? Has it been submitted before? Has it been verified by someone other than the source itself? Is it actionable (meaning does there need to be follow up response)? Is that response urgent or can it be tempered?

In other words how do we filter the wheat from the chaff, the signal from the noise, in ways that allows us to respond to emergencies efficiently? This is the problem Swift River will attempt to solve by acting as the stop-gap between the deluge of information created online, and the Ushahidi crisis mapping platform. Many feeds come in, one ‘sanitized’ feed goes out. This saves us time and in some cases, saves lives.

There are three components to the first phase of Swift River’s development:

1. Predictive Tagging

The first component focuses on automatically “tagging” text messages and tweets to describe an event using keywords parsed from the content itself. In come cases accompanying meta information will include already tags and location, but for SMS (and often Twitter) this is rarely the case. Thus, if we have other methods of providing additional context, we can at the very least speed up an otherwise tedious process. For instance, with SULSa (our Swift User Location Service) we’ve automated the process of extracting latitude and longitude from IP address.

2. Verification and Taxonomy

The second component crowdsources the veracity of this automatic tagging to ensure that it is correct. Users vote on tags, providing a model of accuracy that can be reapplied to our algorithm. This ‘learning’ is critical to our improvement as it ultimately speeds up the tagging process. Taxonomy simply refers to the sorting of this content based on what we’ve learned thus far. This allows us to begin sorting content based on applied tags, location and relevancy.

3. Filtering by Authority and Trust

The third component is a learned ‘trust’ algorithm that scores content sources based on user behavior. If we aggregate multiple sources, and content from one source is consistently deemed to be more accurate than content coming from other sources, we can begin assigning scores to these sources. We aren’t interested in algorithms that rank individual content (Digg-like voting algorithms) but rather rating the point of origin. Content (good or bad) tells us a great deal about the source itself and trusted sources can be prioritized, even though all content is eventually reviewed.


These are the first milestones for realizing the Swift River platform. The second set of milestones will involve expanding the sources that Swift River parses to include full text articles and sites like Flickr (photos) and Youtube (video). The third milestone (next blog post in the series) will focus on clustering tags and determining implicit relationship between content from different sources. The fourth milestone will apply innovative techniques to visualize clusters and probability scores in compelling ways. And of course the fifth milestone is the incremental improvement of every component of our system.

As we release builds of Swift we’ll update this series accordingly, with the things we’re working on and (once released) what users can expect next. If you’re a developer interested in working with the Swift River team you can reach us here.

Jon Gosier and Patrick Meier

Posted in Development, SwiftRiver. Tagged with , , , , , .

20 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Hello
    Its is interesting to read about those three components of Swift River’s development.There is good information about components..Thanks for this post.

  2. If you strategically hire 2 people (they could be volunteers) and they “review” twitter feeds/SMS at the rate of 120 per minute, they can through your 100K reports in 2 days.

    {[(120*60)per hour*8] per day}*2 days} ==115,200

  3. While I see a number of people are working on dealing with the quality of information received, what I have found to date seems to be focused upon misinformation and filtering for action prioritising.

    I am wondering about the issue of intentional misinformation issues.

    In many instances, there are entities with a vested interest in preventing valid information regarding things such as voting, battles and even disasters, both natural and man-made.

    For nearly any human effort, there exist a group of entities which would profit by either the details or the extent of a problem being kept from the public–and that can include relief agencies.

    While tracking particular sources and their validity of reports is a step in the right direction, some entities, in particular governments and large corporations have access to the resources needed to generate thousands or even 100,00s of thousands of false data reports, flooding the system with misinformation.

    I have no answers, and I’m not certain of the correct questions, but we can be certain that the limits to the amount of pain, suffering and death to which some entities are willing to go in pursuit of their own ends, seems to be limitless, and thus we need to be able to deal with such situations if/when they arise.

  4. Great comments, Charles. However, we’ve not only anticipated this, our platform is very much being designed for such scenarios. With Swift, we aren’t just validating content, we’re also validating users, users validate each other and content validates users. Content can also be used to verify other content. This creates a system that’s difficult to dupe, as one looking to falsify information would need to thousands of false reports from a number of different ‘users’, locations, and media channels.

    What would be absolutely possible is for a group to download Swift, set up their own instance with all sorts of fake information and publicize it as fact. However, our distributed, decentralized reputation system River ID would show that outside of that instances ‘ecosystem’ no one trusts those users, or the instance. If the administrators opt out of tracking…they also forfeit any sort of benefits that come from River ID (trust from users who don’t know you or your site). In this case falsifying information is indeed easy, but promoting it becomes self-defeating, as the more people who aren’t under your influence see it, the less authority your Swift instance (with all it’s fake reports) actually holds.

    Here’s some more reading on this very subject – http://irevolution.wordpress.com/2010/04/08/wag-the-dog/

  5. A very promising concept. Software that learns with each message it receives by using filters to improve the accuracy of the articles and messages found online.

  6. Ken said

    Automated software, could help you or harm you. Proper input and testing and it could be done. There would still need to be some human input. Just ask Google.

  7. George said

    Your comments and concept are interesting, however I feel the old computer saying “Garbage In Garbage Out” may come into play here. If the originals sources are skewed then the ultimate result will be as well. Look at the difference in the slant of the news presented by the major news networks. They may all report the same info from eye witnesses and respectred news gathering agencies however when it get broadcast the same event can be explained in a totally different light

  8. R4 said

    Very intressting read i must say that it really stopped me and made me think.

Continuing the Discussion

  1. How much will it cost for me to get a volvo key made with the chip … | Volvo Automotive Marque linked to this post on 27 January 2010

    [...] Separating the Wheat from the Chaff – The Ushahidi Blog [...]

  2. Ronin Automotive Blog : Ronin Cars Automotive Blog | Dodge Automotive Marque linked to this post on 27 January 2010

    [...] Separating the Wheat from the Chaff – The Ushahidi Blog [...]

  3. Tweets that mention Separating the Wheat from the Chaff – The Ushahidi Blog -- Topsy.com linked to this post on 27 January 2010

    [...] This post was mentioned on Twitter by ushahidi, ushahidi, meedan, Maja A, MirelaMonte and others. MirelaMonte said: RT @ushahidi: Excellent explanation of how Swift will help separate 'Wheat from the Chaff' in emerging flows of information http://ow.ly/10Tfo via @meedan [...]

  4. Haiti: Where Are and Where We Go From Here – The Ushahidi Blog linked to this post on 29 January 2010

    [...] are doing this manually right now, but with Swift River in the works, we hope to be able to validate crowdsourced information in near real time using [...]

  5. » Software der redder liv - blogs.berlingske.dk linked to this post on 3 February 2010

    [...] kan påvirke den information, der spredes, men den slags kan der tages højde for i programmet (Swift River hedder næste generation). Og der er problemer med at få budskabet om muligheden ud: Det nytter [...]

  6. Using Mechanical Turk to Crowdsource Humanitarian Response « iRevolution linked to this post on 6 February 2010

    [...] their observations which would further help triangulate the veracity of the evaluation à la Swift River. Note that the Diaspora could also get involved in this. And like txteagle, statistical machinery [...]

  7. Recession Shopping: Tips For Designer Baby Clothes | Baby Care Information & Advice linked to this post on 13 February 2010

    [...] Separating the Wheat from the Chaff – The Ushahidi Blog [...]

  8. HIVOS funding and Ushahidi – The Ushahidi Blog linked to this post on 16 February 2010

    [...] focused on getting information into the system, and volunteer teams of crisis mappers gathered in situation rooms in Boston, DC and Geneva to “crowdsource the filter” and make sense of the mountain of incoming [...]

  9. From Netsourcing to Crowdsourcing to Turksourcing Crisis Information « iRevolution linked to this post on 16 March 2010

    [...] can be disaggregated into human intelligence tasks (HITs) combined with some automation, like Swift River. And none of this would require prior [...]

  10. Ushahidi Twitter Intelligence Tool Released linked to this post on 26 March 2010

    [...] How do you filter it in a way that saves time, without sacrificing accuracy? This is the problem SwiftRiver is attempting to solve. Share Innovation code, haiti, swift, swiftriver, Twitter, ushahidi [...]

  11. Visualizing Redundant Data Validation – The Ushahidi Blog linked to this post on 9 May 2010

    [...] the SwiftRiver platform. They are in part a response to this comment from reader Charles Bernard on this post. His comment: In many instances, there are entities with a vested interest in preventing valid [...]

  12. Visualizing Redundant Data Validation by Jon Gosier (Ushahidi) « surflightroy linked to this post on 10 May 2010

    [...] the SwiftRiver platform. They are in part a response to this comment from reader Charles Bernard on this post. His comment: In many instances, there are entities with a vested interest in preventing valid [...]

Some HTML is OK

(required)

(required, but never shared)

or, reply to this post via trackback.