Beyond reliability: An ethnographic study of Wikipedia sources

Ushahidi
Aug 7, 2012

Also at ethnographymatters.net

Almost a year ago, I was hired by Ushahidi to work as an ethnographic researcher on a project to understand how Wikipedians managed sources during breaking news events. Ushahidi cares a great deal about this kind of work because of a new project called SwiftRiver that seeks to collect and enable the collaborative curation of streams of data from the real time web about a particular issue or event. If another Haiti earthquake happened, for example, would there be a way for us to filter out the irrelevant, the misinformation and build a stream of relevant, meaningful and accurate content about what was happening for those who needed it? And on Wikipedia’s side, could the same tools be used to help editors curate a stream of relevant sources as a team rather than individuals?

Original designs for voting a source up or down in order to determine "veracity"

When we first started thinking about the problem of filtering the web, we naturally thought of a ranking system which would rank sources according to their reliability or veracity. The algorithm would consider a variety of variables involved in determining accuracy as well as whether sources have been chosen, voted up or down by users in the past, and eventually be able to suggest sources according to the subject at hand. My job would be to determine what those variables are i.e. what were editors looking at when deciding whether to use a source or not? I started the research by talking to as many people as possible. Originally I was expecting that I would be able to conduct 10-20 interviews as the focus of the research, finding out how those editors went about managing sources individually and collaboratively. The initial interviews enabled me to hone my interview guide. One of my key informants urged me to ask questions about sources not cited as well as those cited, leading me to one of the key findings of the report (that the citation is often not the actual source of information and is often provided in order to appease editors who may complain about sources located outside the accepted Western media sphere). But I soon realized that the editors with whom I spoke came from such a wide variety of experience, work areas and subjects that I needed to restrict my focus to a particular article in order to get a comprehensive picture of how editors were working. I chose the 2011 Egyptian revolution article because I wanted a globally relevant breaking news event that would have editors from different parts of the world working together on an issue with local expertise located in a language other than English. Using Kathy Charmaz’s grounded theory method, I chose to focus editing activity (in the form of talk pages, edits, statistics and interviews with editors) from the 25th of January, 2011 when the article was first created (within hours of the first protests in Tahrir Square), to the 12th of February when Mubarak resigned and the article changed its name from '2011 Egyptian protests' to '2011 Egyptian revolution'. After reviewing the big picture analyses of the article using Wikipedia statistics of top editors, and locations of anonymous editors etc, I started work with an initial coding of the actions taking place in the text, asking the question ‘What is happening here?’ I then developed a more limited codebook using the most frequent/significant codes and proceeded to compare different events with the same code (looking up relevant edits of the article in order to get the full story), and to look for tacit assumptions that the actions left out. I did all of this coding in Evernote because it seemed the easiest (and cheapest) way of importing large amounts of textual and multimedia data from the web, but it wasn't ideal because talk pages, when imported, need to be re-formatted and I ended up using a single column to code data in the first column since putting each conversation on the talk page in a cell would be too time-consuming.

Screenshot of my Evernote desktop showing initial coding

I then moved to writing a series of thematic notes on what I was seeing, trying to understand, through writing, what the common actions might mean. I finally moved to the report writing, bringing together what I believed were the most salient themes into a description and analysis of what was happening according to the two key questions that the study was trying to ask i.e. How do Wikipedia editors, working together, often geographically distributed and far from where an event is taking place, piece together what is happening on the ground and then present it in a reliable way? And: how could this process be improved? Ethnographymatters has a great post by Tricia Wang that talks about how ethnographers contribute (often invisible) value to organizations by showing what shouldn’t be built, rather than necessarily improving a product that already has a host of assumptions built into it. And so it was with this research project that I realized early on that a ranking system conceptualized this way would be inappropriate – for the single reason that along with characteristics for determining whether a source is accurate or not (such as whether the author has a history of presenting accurate news article), there are a number of important variables that are independent of the source itself. On Wikipedia, these include variables such as the number of secondary sources in the article (Wikipedia policy calls for editors to use a majority of secondary sources), whether the article is based on a breaking news story (in which case the majority of sources might have to be primary, eyewitness sources), or whether the source is notable in the context of the article (misinformation can also be relevant if it is widely reported and significant to the course of events as Judith Miller’s NYT stories were for the Iraq War).

This means that you could have an algorithm for determining how accurate the source has been in the past, but whether you make use of the source or not depends on factors relevant to the context of the article that have little to do with the reliability of the source itself. Another key finding recommending against source ranking is that Wikipedia’s authority originates from its requirement that each potentially disputed phrase is backed up by reliable sources that can be checked by readers, whereas source ranking necessarily requires that the calculation be invisible in order to prevent gaming. It is already a source of potential weakness that Wikipedia citations are not the original source of information (since editors often choose citations that will be deemed more acceptable to other editors) so further hiding how sources are chosen would disrupt this important value. On the other hand, having editors provide a rationale behind the choice of particular sources, as well as showing the variety of sources rather than those chosen because of loading time constraints may be useful – especially since these discussions do often take place on talk pages but are practically invisible because they are difficult to find. Analysing the talk pages of the 2011 Egyptian revolution article case study enabled me to understand how Wikipedia editors set about the task of discovering, choosing, verifying, summarizing, adding information and editing the article. It became clear through the rather painstaking study of hundreds of talk pages as to how editors were: Wikipedia editor, aude's method for storing relevant articles during the initial days of the Egyptian revolution a) storing discovered articles either using their own editor domains by putting relevant articles into categories or by alerting other editors to breaking news on the talk page, b) choosing sources by finding at least two independent sources that corroborated what was being reported but then removing some of the citations as the page became too heavy to load, c) verifying sources by finding sources to corroborate what was being reported, by checking what the summarized sources contained, and/or by waiting to see whether other sources corroborated what was being reported, d) summarizing by taking screenshots of videos and inserting captions (for multimedia) or by choosing the most important events of each day for a growing timeline (for text), e) adding text to the article by choosing how to reflect the source within the article’s categories and providing citation information, and f) editing disputing the way that editors reflected information from various sources and replacing primary sources with secondary sources over time. It was important to discover the work process that editors were following because any tool that assisted with source management would have to accord as closely as possible with the way that editors like to do things on Wikipedia. Since the process is managed by volunteers and since volunteers decide which tools to use, this becomes really critical to the acceptance of new tools.

After developing a typology of sources and isolating different types of Wikipedia source work, I made two sets of recommendations as follows: 1. The first would be to for designers to experiment with exposing variables that are important for determining the relevance and reliability of individual sources as well as the reliability of the article as a whole. 2. The second would be to provide a trail of documentation by replicating the work process that editors follow (somewhat haphazardly at the moment) so that each source is provided with an independent space for exposition and verification, and so that editors can collect breaking news sources collectively. Regarding a ranking system for sources, I'd argue that a descriptive repository of major media sources from different countries would be incredibly beneficial but that a system for determining which sources are ranked highest according to usage would yield really limited results (we know, for example, that the BBC is the most used source on Wikipedia by a high margin, but that doesn't necessarily help editors in choosing a source for a breaking news story). Exposing the variables used to determine relevancy (rather than adding them up in invisible amounts to come up with a magical number) and showing the progression of sources over time offers some opportunities for innovation. But this requires developers to think out of the box in terms of what sources (beyond static texts) look like, where such sources and expertise are located and how trust is garnered in the age of Twitter. The full report provides details of the recommendations and the findings and will be available soon.

There are a variety of independent variables that determine whether or not sources are used

This is my first comprehensive ethnographic project and one of the things I’ve noticed having done other design and research projects using different methodologies is that, although the process can seem painstaking and it can prove difficult to articulate the hundreds of small noticings into findings that are actionable and meaningful to designers, getting close to the lived experience of editors is extremely important and valuable work that is rare in Wikipedia research. I realize now how before I actually studied an article in detail, I knew very little about how Wikipedia works in practice. And this is only the beginning!