A decade ago, let’s say you wanted to know the population of the metropolitan area of Accra, Ghana using the open web. For a quick answer, you might look at a Wikipedia article’s infobox on Accra.
There would be a number. Let’s say you used another language version, French Wikipedia. You might get a second number. With a search engine, a third number.
Each might be correct, contextually. Of course, people are born, die, move, and boundaries are being negotiated. Any population statistic is bound to be out-of-date from the moment it’s collected.
But the variation—and lag—in updates on data points like population across Wikipedias, not to mention elsewhere on the web, has frustrated open access semantic web advocates, working for more machine-readable linked data online. Because the inconsistency isn't an issue of controversial records or out-of-date datasets. Rather, it's a problem of unlinked data.
Enter Wikidata, in 2012. A machine and human-readable linked knowledge base that straddles the best of both. Humans can edit. Machines can read.
Update the population of a city in Wikidata; insert the linked identifier into article pages—and bada-bing, bada-boom–when the linked database is updated, all the identifiers running the information also update. The population of Accra, Ghana can be consistent no matter where you look.
Wikidata is a sister project to the better-known crowdsourced encyclopedia, Wikipedia—which has benefits for data journalists.
And both are part of the Wikimedia movement, whose mission is to bring "free educational content to the world."
But unlike Wikipedia, which at 20 years old is recognised for being surprisingly reliable despite predictions that the end of Wikipedia is near, Wikidata is best known for having a promising future—that hasn't quite arrived. Illustrated, for example, by the fact that Accra's population is still unlinked on Wikipedia (the English Wikipedia article's infobox references the census, not Wikidata).
Computers are often very fast, but generally very dumb, so in a system of explicit representation of information, you have to tell them everything.
How can data journalists use the Wikimedia movement's linked knowledge base data?
The first step is discerning the difference between the promise of the project and how it works today. Elisabeth Giesemann, from Wikimedia Deutschland, recently gave a talk for journalists and explained that Wikidata is actualising the vision of a semantic web touted by Sir Tim Berners-Lee, creator of the world wide web.
Berners-Lee similarly champions Wikidata. He co-founded the Open Data Institute, which recently honoured WikiData with a special award.
Though there’s evidence that Wikidata is already ushering in a new era of linked data—with the dataset being incorporated into commercial technologies such as Amazon's Alexa and Google's knowledge graph—there are limitations, including whether or not web pages link to Wikidata.
The project suffers from the biases and vandalism that plague other Wikimedia projects. Including gender gaps in the contributor base—the majority of the volunteer editors are male. And the majority of the data is from—and about—the Northern hemisphere. The project is young, Giesemann emphasises.
As a concept, one might compare Wikidata to a busy train station. There are millions of links between data points and interlinks to other open datasets.
Denny Vrandečić, who designed Wikidata and worked for Google for six years as an ontologist before joining the Wikimedia Foundation last summer, said Wikidata connects to 40,000 other databases, including Wikipedia, DBpedia, Library of Congress, German Bibliotech, and VIAF.
“[It’s] a backbone of the web of data,” said Kat Thornton, a researcher at Yale University Library with expertise in linked data. “If you are interested in data, it is better than search. It would be oversimplifying Wikidata to call it search. [There] you are matching the string, Wikidata’s web of knowledge is far more powerful than string matching.”
How Wikidata is designed to work
Like other linked data projects, Wikidata models information using the Resource Description Framework (RDF). This model expresses data in semantic triples. Subject --> predicate --> object. For example, Accra is an instance of a city.
An item can be a definite object—a specific book, person, event, or place. Or a concept—transgender, optimism, or promise. Items are identified with a “Q” and a unique number.
Accra is Q3761. City is Q515. Predicates are labelled with a “P” and a number. Instance of is P31. Relationships between items are stored as statements. Accra is an instance of a city. Q3761-->P31-->Q515.
“Computers are often very fast, but generally very dumb, so in a system of explicit representation of information, you have to tell them everything. Tokyo is a city. The sky is up. Water is wet,” wrote Trey Jones, a software engineer for the Wikimedia Foundation, in a recent article on linked data.
However, many data are factually contentious. When is someone dead? Most births and deaths are cut and dry—unless you are Terri Schiavo.
How about Taiwan? The instance of a sovereign country or the territory of another country like Sudan, Palestine, Crimea, Northern Ireland. The list goes on (and it does).
Helpfully, Wikidata allows for ambiguity. Taiwan is an instance of a country. Taiwan is an instance of a territory. Both statements can exist in the item.
There’s also room for references, which enable the reuse of linked data by improving error detection. But contributors can make statements without them.
This lowers barriers to entry for data donations and new contributions, but can mean that "inaccurate data, or messy database imports, such as peerage or vandalism, are a challenge," said Jim Hayes, a volunteer Wikidata contributor and Wikimedia D.C. member in an email interview.
As a concept, one might compare Wikidata to a busy train station. There are millions of links between data points and interlinks to other open datasets.
As a result, there are partialities in the available linked data. Some organisations have already donated data.
There are 25,000 items for Flemish paintings in Wikidata, thanks to 2015 data donation from a collective of Flemish museums and galleries. But other topics—such as the cultural heritage of nations in the Global South—are left wanting.
This is a topic Mohammed Sadat Abdulai, a co-lead of the non-profit organisation Art+Feminism and community communication manager with Wikimedia Deutschland, deals with daily.
He said in a recent phone call that Wikidata’s eurocentrism can materialise not only through the presence or absence of data, but also in the subtle way that data is modelled.
“If you come from a different way of thinking, you will find it is difficult to model your way of thinking with Wikidata,” he said. He gave the example of name etymologies. “There are Ghanian names in Dagbani that are meaningful through their connection to days of the week,” said Abdulai. “Atani means a female born on a Monday. But this meaning is not easy to model using subclasses in Wikidata.”
Abdulai strives to expand representation of Ghana with Wikidata, but his experience suggests there can be linguistic ghettos. “It is a good thing Wikidata is flexible,” he said. “You can find your own way of modelling. But since it is not conventional, you end up working in your own little space.”
Probe differences in titles and coverage to understand possible regional differences and sources on the topic. This screenshare video uses the article on the 2020 women’s strike against Polish abortion law as an example. It demonstrates how to find the Wikidata link and access multiple language versions of a Wikipedia article.
Three ways data journalists can bring Wikidata into their data storytelling
1. As a shortcut between Wikipedia versions
One easy way to use Wikidata is as a node of connection between Wikipedias, which you can also use for data journalism. The Wikidata item is the node of connection between articles in different language Wikipedias. This shortcut can help you check for variation in existing coverage, and pry for new angles.
For instance, the 2020 women's strike in Poland against the abortion law has articles in 17 Wikipedias in different languages, each a slightly different version providing coverage of the strike. To quickly dig into details of variation on Wikipedias, and the narratives they index, use Wikidata to toggle between article versions.
If we’re going to call other sites untrustworthy, we can’t just say “trust us” as the reason why.
2. Use Wikidata at scale. The API is available and the interface is multilingual.
This requires caution, however. Barrett Golding can attest. When the pandemic hit last year, Golding—a former NPR producer and freelance data journalist—launched Iffy.news. Designed for researchers and journalists, the site contains indexes and lists on sources of mis/disinformation.
Golding uses large-scale data harvesting to showcase whether or not a website has a reputation for fact-checking, based on credibility rankings from databases such as Media Bias/Fact Check. “If we’re going to call other sites untrustworthy, we can’t just say “trust us” as the reason why. So each Iffy site links to the failed fact-checks that make that site unreliable,” Golding explained.
More recently, he began accessing information from Wikidata and Wikipedia to cross-check the reliability of websites and online sources, thanks to a grant from WikiCred. (For disclosure, I am also working on a project on reliable sources and Wikipedia funded by Wikicred).
That’s where things fell apart. The data were too piecemeal.
Infowars, a well-known instance of “fake news,” is described as such in its English Wikipedia article. Wikipedia editors have also blacklisted the website from being used as a reliable source in citations, according to the hand-updated list of Perennial sources.
But these classifications didn’t make it to Wikidata. Infowars was just an instance of news satire and a website. (That is, until two weeks ago, when the item was edited to include as an instance of “fake news”).
The takeaway for data journalists? Be aware that large-scale data harvesting from Wikidata’s API can scrape out nuance at scale, rather the other way around.
Leverage Wikidata's strengths
There are Wikidata storytelling success stories, which often include using the dataset in conjunction with other data. Laura Jones (no relation to Trey or the author), a researcher with the Global Institute for Women’s Leadership at King's College London authored a report that shows how women—journalists and experts—have been involved in coronavirus media coverage.
To find out, the study used Wikidata and Wikipedia’s API to identify the gender and occupation of 54,636 unique people who had been mentioned in a vat of news content sourced from Event Registry's API, an AI-driven media intelligence platform, during the 2020 pandemic.
Thanks to the information stored in Wikidata, Jones was able to identify most of the unique people mentioned in the news coverage – experts and journalists. The majority of media coverage about the pandemic was written by male journalists, while one out of five expert voices who were interviewed about the pandemic were female, Jones concluded.
Science Stories.io uses Wikidata and other linked data projects to visualise stories about women in science and academia. By aggregating images, structured data, and prose at scale, Science Stories.io generates hundreds of multimedia biographical portraits of historical and contemporary notable women.
While Scholia pulls data from Wikidata to create visual profiles on items including chemicals, species, and people.
3. Discover relationships through Wikidata's query service
Get a sense of what’s in Wikidata, and how this may aid your data storytelling, through querying. The Wikidata query service is free and available online. You’ll need to use SPARQL, a variation of SQL, the relational database management system.
Whether you are already familiar with SPARQL or just getting started, there are an abundance of tutorials and training videos to learn. The query service also has examples. Run an example yourself for fun by pressing the blue “play” button. There are also volunteer users who are willing to run queries for you.
You can modify examples or write a query from scratch. Query results can be visualised, shared, downloaded, or embedded. It's worth running a query before you use the API or download a data dump.
When it comes to an effort like Golding’s project on “fake news” websites, the query could be the first red flag that the data just isn’t there. For instance, a query for instances of “fake news” websites in Wikidata reveals less than a dozen.
Try it—and keep in mind the results from your query will be as of the date you run the query, not as of mine, nor as of Goldings. Right now, there’s no way to share a hyperlink to a historical version of a query). Part of the problem, as Golding found, are idiosyncratic classifications. Some items are instances of websites, others are online newspapers.
Another query example is a “Timeline of Death by Burning.” (Try it). I modified the query by substituting the cause of death (P509) from burning (Q468455) to decapitation (Q204933). (Try it). Both rendered grisly timelines showcasing a long history of these particular forms of death.
My next modification reveals a limitation of the dataset. I wanted to create a timeline of women who are murdered in gender-based violence. This has a name, femicide (Q1342425). Again, grisly, I know—but also important. I expected the number might be lower than reality. But I was not prepared for a timeline of one. One femicide. (Try it).
Liam Wyatt, who manages the Wikicite programme for the Wikimedia Foundation, said this is a typical pitfall. “You have to caveat any query result with 'as far as Wikidata knows,'” he explained in a phone interview.
For my query, it’s possible there are other femicides documented in Wikidata. Categorised as instances of homicide, or domestic violence. But there is invisibility yet. For instance, the murder of Pınar Gültekin by her ex-boyfriend last year made headlines around the world. Women took to the streets in Turkey to protest.
And there was a much-debated social media hashtag campaign, #challengeaccepted, to raise awareness about femicide.
While there is a Wikipedia article in English about the murder, and a Wikidata item for the event, Gültekin—not to mention the manner of her death—is not included as a human in Wikidata.
"Wikidata is now powerful and important, but still esoteric and incomplete,” said Wyatt, on the ambiguities of the current state of Wikidata. “It’s a bit of a wild west. Journalists who can get in on the ground floor, on this wave while it’s still picking up speed, they will really be in a position to ride the momentum.”
A promising project, when we remember that promise, according to Wikidata, is also known as liability. That is, to be “held morally or legally responsible for action or inaction.” Use it freely and be mindful to not substitute this dataset for news judgement.
As Last Moya writes in “Data Journalism in the Global South”, data can aid journalism in speaking truth to power provided “journalistic agency and not data is King.”
Thanks to Molly Brind'Amour (University of Virginia), Will Kent (Wiki Education Foundation), Lane Rasberry (University of Virginia), and Houcemeddine Turki (Wikimedia Tunisia) for speaking with me for background research for this story.
The promise of Wikidata - How journalists can use the crowdsourced open knowledge base as a data source
13 min Click to comment