Democratizing Data Science

Ushahidi
Mar 26, 2011

We like to say that our mission with the SwiftRiver project is to democratize access to the tools used for understanding information. To me that means taking the hard-work out of drawing insight from excessive quantities of data, to help humans process things more efficiently. That's why it was huge honor to announce the SwiftRiver project's ongoing collaborations with software developer Pete Warden earlier this year. Earlier this week Pete announced a cool project closely aligned with our mission called the Data Science Toolkit:

The Data Science Toolkit is a collection of data tools and open APIs curated by our own Pete Warden. You can use it to extract text from a document, learn the political leanings of a particular neighborhood, find all the names of people mentioned in a text and more. He unveiled it today at GigaOM Structure Big Data in New York City. It's available as a Web service, or you download a virtual machine and host it on your own server.

Street Address to Coordinates - Street Address to Location calculates the latitude/longitude coordinates for a postal address.

File to Text - Converts PDFs, Word Documents, Excel Spreadsheets to text. Recovers text from JPEG, PNG or TIFF images of scanned documents.

Coordinates to Political Areas - Returns the country, region, state, county, constituencies and neighborhood a point is inside.

Geodict - Geodict pulls country, city and region names from unstructured English text, and returns their coordinates.

IP Address to Coordinates - IP Address to Location calculates country, state, city and latitude/longitude coordinates for IP addresses.

Text to Sentences - Removes any parts of the text that look like boilerplate instead of real sentences.

HTML to Text - Returns the full text that would actually be displayed in the browser when an HTML document was rendered.

HTML to Story - Takes an HTML document representing a news article or similar page, and extracts just the story text.

Text to People - Spots text fragments that look like people's names or titles, and guesses their gender where possible.

The DSTK project joins a number of similar open data science tools on the market. Increasingly there's a need for people of all types to own and control their own data in ways that are easy to utilize or deploy. It's one of the reasons people use Ushahidi products, apps like ours lower the barrier to entry for those who want simple ways to collect or visualize data. Hence the reason we're actively contributing to GeoDict and the greater DSTK initiative. Find out more at datasciencetoolkit.org.