Orientations to Wikipedia often begin with its enormity. And it is enormous. The encyclopedia will be 20 years old in January 2021 and has more than 53 million articles in 314 languages. Six million are in English. According to Alexa.com, Wikipedia is the 8th most-visited web domain in the United States, and the 13th globally; it’s the only non-profit in the top-100 domains. In November 2020, more than 1.7 billion unique devices from around the world accessed Wikipedia articles. Average monthly pageviews surpass 20 billion.
Beyond reach, there’s the data. All data on and about all Wikipedias—from pageview statistics, most-frequently cited references, to access to every version ever written and all the editors who have ever contributed to it—is freely available. Entire version histories are available at dumps.wikimedia.org.
Thanks to the free and open access to billions of human and machine-readable data, corporations and research centres have been leveraging Wikipedia for research for years. Benjamin Mako Hill, assistant professor of communication at the University of Washington and Aaron Shaw, associate professor of communication at Northwestern University, describe Wikipedia as the “most important laboratory for social scientific and computing research in history” in their chapter in "Wikipedia@20", a new book on Wikipedia published by MIT Press, edited by Joseph Reagle and Jackie Koerner.
“Wikipedia has become part of the mainstream of every social and computational research field we know of,” Hill and Shaw write. Google’s knowledge graph and smart AI technologies, such as Amazon’s Alexa and Google Home, are based on metadata from Wikimedia projects, of which Wikipedia is the best-known. Significant for data journalists is how Wikipedia’s influence has already surpassed clicks to article pages; in a way, the internet is already Wikipedia’s world, we’re just living in it.
But journalists know well that ubiquity shouldn’t stand in for universality. We should be mindful that indiscriminate use of “big data” without acknowledging context reproduces what Joy Buolamwini, founder of the Algorithmic Justice League, calls the “coded gaze” of white data. Safiya Umoja Noble, a critical information studies expert and associate professor at UCLA, challenges the acceptance of invisible values that normalise algorithmic hierarchies.
Internet search results, which often prioritise Wikipedia articles in addition to using Wikipedia’s infobox data or structured data in sidebars, “feign impartiality and objectivity in the process of displaying results” Noble writes in "Algorithms of Oppression: How Search Engines Reinforce Racism".
Systemic biases on Wikipedia, including well-documented “gaps” in coverage, readership, and source, are cause for pause. Globally, volunteer contributors are predominately white males from the northern hemisphere. On English Wikipedia, less than 20% of editors self-identify as female. Asymmetries in participation have impacted the editorial processes and content. Editors who self-identify as women often perform “emotional work” to justify their contributions. Women and nonbinary users on Wikipedia may encounter hostile, violent language and some have experienced harassment and doxing. Then there are the asymmetries in the breadth and depth of coverage; only approximately 17% of biographies on English Wikipedia are about women.
How to contribute to Wikipedia
Anyone can edit Wikipedia, but there is an editorial pecking order and policies to keep in mind. Tips for success:
Assuming you have created an account, be sure to include a bio on your user page (you don't need to use your real name, but you can).
Improve existing articles to begin, you can create new articles once your account is four days old and you’ve made ten edits.
Include verifiable citations to secondary sources for any new claims--or claims where a citation is needed.
Be aware of Wikipedia’s guidelines on conflicts of interest.
Beyond this, there are many tutorials and videos with various tips and tricks. Among them, this is a useful high-level summary, while an editing tutorial hosted by the Wikimedia Foundation walks you through nitty-gritty basics.
With this glut of imperfect or missing data, what’s a data journalist to do? Journalists doing internet research might consider that they are already knee-deep in a minefield of constraints.
“The reality for journalists working on the internet is fraught,” said Hill. “Most internet data sets are controlled by commercial companies. That means there’s never going to be a full data set and what’s available has been—or is being—manipulated. Wikipedia is different. It’s free, it’s accessible, and it’s from a public service organisation.” Like any institution, as Catherine D’Ignazio has pointed out in this publication, context may be hard to find. On Wikipedia, that’s often due to the decentralised organisation of open source projects; volunteers come and go, rather than intentional obfuscation.
Nevertheless, Noam Cohen, a journalist for Wired and The New York Times who has written about Wikipedia for nearly two decades, said in a phone interview that journalists should—if they are not already—use Wikipedia’s data, including pageviews and the layers of information found in article pages. But Cohen cautions journalists not to let Wikipedia’s decisions on coverage replace news judgement. “In journalism, word length is often a sign of importance,” Cohen said. “That’s not the case on Wikipedia, there are articles about "The Simpsons" or characters on "Lost" that are longer than articles about important women scientists or philosophers. But these trends don’t mean there are not rules. There are, the information is changing.”
To leverage Wikipedia’s superpowers for data journalism, it’s best to climb into the belly of the beast.
Last year, Cohen’s editor asked him to write about why his Wikipedia biography—which he did not create, there are guidelines barring “conflict of interest editing”—was deleted. Cohen dug in and discovered it was due to “sock-puppetry;” that’s shorthand for editors who use more than one account without disclosure. Later, another editor restored Cohen’s biography.
Stories like this may give journalists discomfort about the contingencies of the online encyclopedia, and any data sets therein. And for as long as there’s been Wikipedia, there have been editors and professors warning us to stay away. But Cohen suggests thinking otherwise. “The fact that information is slowly being changed and is always saved is Wikipedia’s superpower,” said Cohen. To leverage Wikipedia’s superpowers for data journalism, it’s best to climb into the belly of the beast.
Understand how Wikipedia’s authority works
While one might reasonably guess that The Wikipedia Foundation manages editorial oversight, that’s not the case. All content decisions, including developing and managing bots to do tedious, repetitive tasks—fixing redirects or reverting vandalism, as ClueBot_NG does—are designed and run by volunteers. The Wikipedia community has developed a number of policies and guidelines to govern editing, including a rule about verifiability and a blacklist of publications not allowed to be cited on Wikipedia. Blacklisted publications include spam and publications that do not fact check and circulate conspiracy theories.
In 2017, Katherine Maher, executive director of The Wikimedia Foundation, spoke with The Guardian about the volunteer community’s decision to blacklist The Daily Mail as a reliable source. “It’s amazing [Wikipedia] works in practice,” she said, motioning to a concept that academics have called peer-production or crowdsourcing. “Because in theory it is a total disaster.” Wikipedia works in practice, and not in theory. It’s a popular idiom among Wikipedians, as Brian Keegan writes in Wikipedia@20. And it does suggest there’s something magical about the project, where successful shared editing of a single document has been happening long before Google docs.
There is a logic to Wikipedia—no magic. The free encyclopedia launched in 2001 for “anyone” to edit. This was not an explicit democratic effort to engage portions of the public who have historically been left out of structures of power, though some have championed Wikipedia for getting close to achieving this. Rather, the effort was a wildcard reversal of Wikipedia’s failed predecessor, Nupedia, which was designed as a free, peer-reviewed encyclopedia edited by recognised experts. When shifted from experts to “anyone”—that is, people who happened to have computers, internet connections, a penchant for online debate and were familiar with MediaWiki, as opposed to busy academic experts—contributions flowed faster.
Wikipedia was also a product of its time. It was one of many online encyclopedia projects in the early 2000s. According to the Section 230 of the 1996 Computer Decency Act in the United States, Wikipedia, like other platforms then and now, has been immune from legal liability for contents. Section 230 also gives platforms the legal blessing to govern as they see fit. Jimmy Wales, co-founder of Wikipedia, set up the Wikimedia Foundation to oversee the project and sister platforms in 2005, and it has remained volunteer-run. The Wikimedia Foundation has an endowment of more than 64 million, with tech titans such as Amazon pledging millions, and the Foundation supports projects by volunteers and affiliates. English Wikipedia has snowballed in popularity on a commercial internet. Google, for instance, prioritises Wikipedia articles in search results—treats them like “gospel” said Cohen, while the convenience, currency, and comprehensibility of Wikipedia attracts regular readers.
Using pageviews to tell a story
Data journalists can find the granular level of insight about pageviews handy for storytelling. Viewers of Wikipedia come from around the world. The Wikimedia Foundation does not track individual data, but tracks devices across pages. Data about what type of device—mobile app, mobile browser, or desktop browser—are used to access pages. This can give journalists insight into topical and regional access trends.
More radically, pageviews can reveal kernels of stories yet to be broken. Let’s simulate research using pageviews for a story on the rising COVID-19 case count in light of concerns about circulation of misinformation and disinformation on the virus. Digging into pageview data on COVID-19 articles in English Wikipedia can help to tell this story, and others like it.
In spring 2020, as unprecedented economic and social changes unfolded across the globe, journalists were at the forefront of providing coverage on this moment. Meanwhile, conspiracy theories were gaining visibility in social media groups, while edit counts and information queries about all articles related to COVID-19 were at their highest to date.
By mid-November 2020, a new trend. Positive cases of COVID-19 skyrocketed around the globe. Several European countries and U.S. states re-introduced lockdown measures to slow the spread of the virus. But Wikipedia pageviews for articles about COVID-19 were not rising, in fact, they were lower than earlier in the year. The election pageviews on the presidential candidates and their families were cresting with the U.S. election.
Did election coverage distract readers from the pandemic? Spikes in readership on Wikipedia are often the consequence of other media attention or events, which could help to explain for the peaks in views for George Floyd, Donald Trump and Joe Biden. Koerner, who trained as a social scientist, cautions journalists not to make quick deductions about readers' motivations from high-level pageview data. “It’s tricky to say that pageviews are indicative of what people are thinking,” she said. To dig into more granularity, journalists can dig in and compare sets of pageviews using the browser-optimised pageview visualisation tool available.
Meanwhile, pageviews of the COVID-19 general article may have peaked in the spring, but data journalists can note that pageviews of the article “Symptoms of the coronavirus” rose in October, as depicted above, before the peaking case numbers. Incidentally, this correlation could lend credence to the suggestion by a team of epidemiologists in 2014 that high pageview data about influenza-related Wikipedia articles could be used to make predictions about the percentage of Americans with influenza. While it remains to be seen if pageviews can predict illness spikes, the data can offer a wide lens on the zeitgeist.
Behind the scenes
With approximately 300 edits per minute—which is soothing to listen to—Wikipedia is always changing. You may already have edited Wikipedia, the blue “edit” tab is on almost every article page. There are more than 1.2 billion speakers of English and over 40 million Wikipedia accounts.
Maybe you made an account and your changes stuck. Maybe you tried to write an article, only to have it deleted. Or maybe you wondered about how easy it is to add profanity to an article on a popular topic—only to realise that the “Edit” tab is missing. Rather, there’s a lock. Or possibly, a gold star.
Locks. Gold stars. Deletions. These are subtle signs and signals that can help you understand how the editing community works.
Wikipedia’s “best” are marked with green crosses and gold stars, these are Good and Featured content which have undergone “peer review.” They are the minority among Wikipedia's millions, just 0.1%.
Meanwhile, the active editorial community on English Wikipedia monthly is about 4,000 editors. Fewer are administrators. As of November 2020, approximately 1,100 users have successfully undergone a “request for administratorship” and have been granted additional technical privileges, including the ability to delete and/or protect pages. Non administrative editors, however, may patrol new pages and rollback recent changes.
Wikipedia’s editorial judgement can spark justified outrage.
Journalist Stephen Harrison covered this recently in his Slate article on the Theresa Greenfield biography. While archivists, indigenous and feminist communities have noted the reliable source guidelines exclude oral histories, ephemera, and special collections; I am currently co-leading an Art+Feminism research project on marginalised communities and reliable source guidelines, funded by WikiCred which supports research, software projects and Wikimedia events on information reliability and credibility. Data journalists can follow debates on-wiki, and note what is absent, by looking at article Talk and View history tabs, and on notice boards for deletion and reliable sources.
At the same time, there’s plenty to be discovered with Wikipedia. Article features such as wikilinks, citations, and categories can help data journalists quickly access a living repository of information.
In 2011, an editor began a list that documented people killed by law enforcement in the United States, both on duty and off duty. Since 2015, the annual average number of justifiable homicides reported was estimated to be near 930. Tables about gun violence, have been collected on Wikipedia for nearly a decade.
The integrity of this list was brought to my attention by Jennifer 8. Lee, a former New York Times journalist. She expressed surprise that there are not more examples of journalists using Wikipedia’s data. Lee would know, she co-founded the U.S.-based Credibility Coalition and MisinfoCon, and supports WikiCred, which addresses credibility in online information and includes Wikipedians, technologists, and journalists.
“[These] are fascinating and useful,” said Lee. “Not automated, this is a hand-written list. It’s all in one place. This is useful for journalists and those of us in the credibility sphere to use it for research.”
Ed Erhart, who works with the Wikimedia Foundation’s audience engagement team, suggests that stories can not only be a repository but fodder for coverage. “I like to say that there is a story in every Wikipedia article,” he wrote by email, drawing my attention to a Featured article about a small town, Arlington, Washington. “Who wrote it? Where are they from? What motivated them? The talk and history tabs on Wikipedia's pages can be the starting point for some truly unique takes on local places and issues.”
Quick links
More about page protection
More about user access levels
More about featured articles and lists
Icon for featured articles is a star
Icon for good articles is a green circle with a plus sign
Catching malfeasance
Data journalists can follow edits to track corporate or governmental malfeasance. Article pages about companies or politicians can be scrubbed to omit negative information. Though editors are required to disclose conflicts of interest on their user page or in the Article's talk page.
Not all contributors disclose. Kaylea Champion, a doctoral student at the University of Washington, led a large scale research project on IP editing and discovered systematic deletions to mining articles. Anonymous editors removed information about environmental contamination and abuse. Champion and her co-authors traced the IP addresses that deleted the incriminating information to the headquarters of the mining companies.
Journalists can do their own large-scale reconstructions of edit histories using data from Wikipedia’s data dump, or manually browse pages of interest. Historical contributions can all be accessed, even if they are not visible on the live page. As well, journalists can reach out to editors by writing a note on their Talk page with information on how to connect.
The below GIF demonstrates how to access View History and compare versions of the Black Lives Matter article page, using the Compare Version History tool. Be sure to use the View History tab to compare version histories, which is shown above. You can also click on the timestamp to view an article in full.
Quick links
Data dumps with complete copies of all Wikipedias
Tracking with bots
Bots can help with tracking. In 2014, a number of bots were launched by volunteers to track edits made by specific IP ranges and posted the findings to Twitter. Parliament WikiEdits, one of the first, still regularly tweets edits made to Parliamentary IPs in the UK. Similar efforts have been available for The White House, European Union, Norwegian Parliament, German Parliament, Goldman Sachs and Monsanto Company, though not all are up to date.
For data journalists interested in setting up a bot that tweets about anonymous Wikipedia edits from particular IP address ranges in their beat, the code is available from Ed Summers on GitHub under a CC0 license.
Data journalists should weigh the public benefit of amplifying hate speech, harassment, or vandalism, which could be a form of coded language, with reporting.
Pitfalls to avoid: steering clear of media manipulation
Summers created @CongressEdits in 2014, which tweeted IP contributions from U.S. capitol computers. The Wikipedian reported that “Twitter-addicted journalists” soon were mining the bots for story ideas -- some of which did reveal manipulation, such as an attempt to water down the entry on CIA torture. @CongressEdits amassed a growing audience. Things came to a head in 2018. A former Democratic staffer (who was later arrested) with access to the U.S. capitol computers inserted personal information to Wikipedia articles about Republican members of the Senate Judiciary Committee. The Twitter account automatically shared out those details with the large following. Twitter banned the bot as a result.
People can intentionally game the editorial system or interconnections between Wikipedia and other social media platforms. Data journalists should weigh the public benefit of amplifying hate speech, harassment, or vandalism, which could be a form of coded language, with reporting. “Why are people editing articles to say that the [mainstream political party] is [name of radical, violent party]? They want the screenshot,” Cohen remarked. “The best way to get a lie into [the] mainstream is to edit an article, let Google pick it up, and get reporting on it. It’s probably a thrill to plant them.”
Furthermore, Wikipedia has no “real name” policy for editors. Some choose to disclose personal details on user pages, which can help gain the confidence of other editors, but this is not required. Thus, manipulators can mimic the behaviour patterns of a group to blend in.
Joan Donovan, director of Technology and Social Change at Harvard Kennedy School’s Shorenstein Center, calls this a “butterfly attack.” Once the fakes are indistinguishable to outsiders from legitimate accounts, the manipulators push contentious issues to divide and delegitimise the group. Be mindful that you are not also falling for a “butterfly attack”—or perpetuating one by accidentally characterising editors as occupying one particular position over another. Instead, get to know the communities behind the data to minimise harm.
If you discover vandalism or hate-speech on a page history, consider the impact of your coverage on a topic that has since disappeared. Be mindful of the extent to which the effort at public service can dually serve as a form of publicity or exposure for people sympathetic with fringe ideologies or violence. Reporters who stumble across data on hate-speech might report on this in aggregate, without identifying particular details, to minimise harm.
Pro tips for navigating Wikipedia:
Get to know Wikipedia’s editorial process and community before reporting on hate speech or harassment
Strongly consider the newsworthiness of articles that might give publicity to fringe ideologies
Use data in aggregate to avoid revealing details
Circular reporting
In 2007, The Independent published an article on Sasha Baron Cohen that included a line that he had previously worked as an investment banker. Days earlier, the claim had appeared in Wikipedia, and was unverified. Later, The Independent’s article became the citation for the erroneous claim.
None of it was true. And Wikipedia editors call incidents like this “citogenesis,” or circular reporting. There is even a Wikipedia article that compiles known instances. Techdebug blog depicted the Baron Cohen example with the good advice to “pay attention to timelines” when reviewing sources of claims on Wikipedia. When using facts from Wikipedia, trust but verify.
With close attention to detail and context, data journalists can use Wikipedia’s trove of data to elucidate stories of the digital landscape. “Wikipedia is more than the sum of its parts” said Cohen. “Random encounters are often more compelling than the articles themselves. The search for information resembles a walk through an overbuilt quarter of an ancient capital. You circle around topics on a path that appears to be shifting. Ultimately, the journey ends and you are not sure how you got there.”
Thanks to Mohammed Sadat Abdulai (Art+Feminism, Wikimedia Deutschland), Ahmed Median (Hack/Hackers), and Kevin Payravi (WikiCred, Wikimedia D.C.), and for taking time to interview with me for background research for this story.
Harnessing Wikipedia's superpowers for journalism - Digging for hidden data inside the world's free encyclopedia
20 min Click to comment