The purpose of Swift River is to validate crowdsourced information in near real-time. In the past few weeks, Ushahidi has received nearly one hundred thousand reports of incident related to Haiti. The majority of these coming from Twitter. Because crowdsourced information often has varying degrees of accuracy, an application is needed to validate this information in a timely manner, especially for Ushahidi. In order for us to map an incident, we need to know some basic things. Is the information true? Where is it coming from? Has it been submitted before? Has it been verified by someone other than the source itself? Is it actionable (meaning does there need to be follow up response)? Is that response urgent or can it be tempered?
In other words how do we filter the wheat from the chaff, the signal from the noise, in ways that allows us to respond to emergencies efficiently? This is the problem Swift River will attempt to solve by acting as the stop-gap between the deluge of information created online, and the Ushahidi crisis mapping platform. Many feeds come in, one ’sanitized’ feed goes out. This saves us time and in some cases, saves lives.
There are three components to the first phase of Swift River’s development:
1. Predictive Tagging
The first component focuses on automatically “tagging” text messages and tweets to describe an event using keywords parsed from the content itself. In come cases accompanying meta information will include already tags and location, but for SMS (and often Twitter) this is rarely the case. Thus, if we have other methods of providing additional context, we can at the very least speed up an otherwise tedious process. For instance, with SULSa (our Swift User Location Service) we’ve automated the process of extracting latitude and longitude from IP address.
2. Verification and Taxonomy
The second component crowdsources the veracity of this automatic tagging to ensure that it is correct. Users vote on tags, providing a model of accuracy that can be reapplied to our algorithm. This ‘learning’ is critical to our improvement as it ultimately speeds up the tagging process. Taxonomy simply refers to the sorting of this content based on what we’ve learned thus far. This allows us to begin sorting content based on applied tags, location and relevancy.
3. Filtering by Authority and Trust
The third component is a learned ‘trust’ algorithm that scores content sources based on user behavior. If we aggregate multiple sources, and content from one source is consistently deemed to be more accurate than content coming from other sources, we can begin assigning scores to these sources. We aren’t interested in algorithms that rank individual content (Digg-like voting algorithms) but rather rating the point of origin. Content (good or bad) tells us a great deal about the source itself and trusted sources can be prioritized, even though all content is eventually reviewed.
These are the first milestones for realizing the Swift River platform. The second set of milestones will involve expanding the sources that Swift River parses to include full text articles and sites like Flickr (photos) and Youtube (video). The third milestone (next blog post in the series) will focus on clustering tags and determining implicit relationship between content from different sources. The fourth milestone will apply innovative techniques to visualize clusters and probability scores in compelling ways. And of course the fifth milestone is the incremental improvement of every component of our system.
As we release builds of Swift we’ll update this series accordingly, with the things we’re working on and (once released) what users can expect next. If you’re a developer interested in working with the Swift River team you can reach us here.
Jon Gosier and Patrick Meier


18 Responses
Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.
Hello
Its is interesting to read about those three components of Swift River’s development.There is good information about components..Thanks for this post.
If you strategically hire 2 people (they could be volunteers) and they “review” twitter feeds/SMS at the rate of 120 per minute, they can through your 100K reports in 2 days.
{[(120*60)per hour*8] per day}*2 days} ==115,200
While I see a number of people are working on dealing with the quality of information received, what I have found to date seems to be focused upon misinformation and filtering for action prioritising.
I am wondering about the issue of intentional misinformation issues.
In many instances, there are entities with a vested interest in preventing valid information regarding things such as voting, battles and even disasters, both natural and man-made.
For nearly any human effort, there exist a group of entities which would profit by either the details or the extent of a problem being kept from the public–and that can include relief agencies.
While tracking particular sources and their validity of reports is a step in the right direction, some entities, in particular governments and large corporations have access to the resources needed to generate thousands or even 100,00s of thousands of false data reports, flooding the system with misinformation.
I have no answers, and I’m not certain of the correct questions, but we can be certain that the limits to the amount of pain, suffering and death to which some entities are willing to go in pursuit of their own ends, seems to be limitless, and thus we need to be able to deal with such situations if/when they arise.
Great comments, Charles. However, we’ve not only anticipated this, our platform is very much being designed for such scenarios. With Swift, we aren’t just validating content, we’re also validating users, users validate each other and content validates users. Content can also be used to verify other content. This creates a system that’s difficult to dupe, as one looking to falsify information would need to thousands of false reports from a number of different ‘users’, locations, and media channels.
What would be absolutely possible is for a group to download Swift, set up their own instance with all sorts of fake information and publicize it as fact. However, our distributed, decentralized reputation system River ID would show that outside of that instances ‘ecosystem’ no one trusts those users, or the instance. If the administrators opt out of tracking…they also forfeit any sort of benefits that come from River ID (trust from users who don’t know you or your site). In this case falsifying information is indeed easy, but promoting it becomes self-defeating, as the more people who aren’t under your influence see it, the less authority your Swift instance (with all it’s fake reports) actually holds.
Here’s some more reading on this very subject – http://irevolution.wordpress.com/2010/04/08/wag-the-dog/
A very promising concept. Software that learns with each message it receives by using filters to improve the accuracy of the articles and messages found online.
Automated software, could help you or harm you. Proper input and testing and it could be done. There would still need to be some human input. Just ask Google.
Continuing the Discussion