APIs for journalism
Conversations with Data: #36
Do you want to receive Conversations with Data? Subscribe
There’s a three letter acronym you’ve probably seen referenced in newsroom methodology documents, or listed as a data source option on various websites: API.
But what does API mean? And why should journalists take notice?
The acronym stands for ‘Application Programming Interface’ and, while they take many forms, APIs are most commonly used by journalists for querying and pulling data from a website’s internal database -- perhaps to power a data visualisation, or to populate a news app.
Despite these use cases, our research for this edition revealed that many of you still haven’t experimented with APIs in your reporting...for now at least. To get you inspired, here’s six ways to use APIs for journalism.
How you’ve used APIs
1. To analyse social media data
Twitter’s APIs were by far the most commonly used by the journalists we spoke to. Whether it’s to perform an issue-based sentiment analysis, or mine comments by politicians, these APIs offer rich pool of data to query.
Take this example from Aleszu Bajak, Editor of Storybench:
“Together with journalism + data science graduate student Floris Wu, I performed an analysis and wrote a piece for Roll Call that used Twitter data to disprove the mantra ‘When they go low, we go high’ uttered by Democrats in the lead-up to the 2018 midterm elections. I used a sentiment dictionary and R while Floris used a Fast.ai sentiment model and Python to arrive at our results -- which they let us publish as a scatterplot built almost entirely using R's ggplot2 package! We used Twitter's API for this project – accessed through Mike Kearney's easy-to-use rtweet package.”
“...we’ve used the Twitter API to study popular sentiment in response to current events. Specifically, we’ve found that the most commonly used emojis in tweets about a given topic (e.g. the 2016 election, or the Taylor Swift - Kanye West dispute) often provide a visually appealing and intuitive roadmap to understanding broad trends in society.”
And if you don’t have coding skills? While J-school lecturer Walid Al-Saqaf impressed that these skills are important for getting the most out of APIs, there are tools available that let anyone extract data from APIs, regardless of skill level. Walid himself has worked on an open source tool called Mecodify, which relies heavily on Twitter's APIs to fetch details of tweets and Twitter users, and can be used by anyone interested in uncovering trends in social media data.
Don’t forget: Twitter data is also likely to be personal data. Before you get started with their APIs, be sure to read up on the ethics of publishing Twitter data in your stories.
2. To build journalistic tools
Like Mecodify, there are plenty of other journalistic tools that help reporters access the power of APIs. We were lucky to speak with the team over at Datasketch, who use the Google Sheets API in their data apps.
“Data comes in different flavors, flat files, json, SQL. But the one data source that we have consistently seen in many newsrooms are Google Sheets. Journalists need to manually curate datasets and a collaborative tool like Google Sheets is appropriate. Unfortunately, it is not so easy for non-advanced users to use Google Sheets for data analysis and visualisations. This is why we use the Google Sheets API to allow journalists who do not have advanced data visualisation skills to connect a spreadsheet to our data visualisation tool -- an open source alternative to Tableau -- and generate many different charts and maps with a few clicks,” Juan Pablo Marin Diaz explained.
3. To pull from media reports
While newsrooms are consumers of APIs, most of them also offer APIs for others to consume as well. Over at The Pudding, Ilia Blinderman and Jan Diehm found a fun way to use APIs from New York Times (NYT):
“We used the NYT Archive API for our A Brief History of the Past 100 Years project, which provided us with an unparalleled look at the issues that were most important to the journalists in decades past. While the archive doesn't allow developers to pull the full text of all articles, it does allow us to search the article contents and retrieve the metadata, as well as some of the text, from the NYT's 150+ year archive. It's a telling look at the issues of the day, and provides a helpful glimpse of the concerns that dominated the national conversation,” Ilia told us.
4. To make maps
When it comes to making maps, APIs can help by pulling useful geographic data. In one of our favourite projects, Wonyoung So used APIs to map the distribution of citizen cartographers in North Korea.
“North Korea is one of, if not the most, closed countries in the world from diplomatic, touristic, and economic standpoints. Cartographers of North Korea aims to discuss how collaborative mapping strategies are used to map uncharted territories, using North Korea as a case study. OpenStreetMap (OSM) enables ‘armchair mappers’ to map opaque territories in which local governments control internet access to its residents. The project tackles the questions of who is mapping North Korea, which tools and methods the contributors use to have access and represent the country, and which are the motivations behind such mapping endeavor,” he said.
“This project heavily relies on the APIs provided by the OSM communities. The OSM data for North Korea was downloaded in October 2018 using Geofabrik’s OpenStreetMap Data Extracts, a service that breaks down OSM Planet data to country level and updates it daily. Contributors' activity can also be estimated by means of OSM changesets, which are a history of each user’s past contributions, and it can be retrieved by OSM API. Using these changesets, one can see which regions other than North Korea contributors have also worked on.”
5. To determine the sex of a name
“I would scrape the list of Oscar nominations and winners, and then use the Genderize.io API to help me determine the sex of each name. For each name, the API would return the most likely sex of the person associated with the provided name, together with a probability estimate (number of occurrences of the name in the database + % of such occurrences associated with the most recurring sex).”
But APIs aren’t foolproof, she warned: “While it sounds very clean and quick, a lot of manual work and fuzziness was still involved. For example, the API worked very well with English names and people, but was pretty clueless with foreign names...‘Andrea’ had a high chance of being a female name (in English), while it is mostly a male name in other languages (Italian). So the API provided a first classification, but then a lot of manual work was involved in the verification.”
6. To follow the money
The Financial Times’ recent investigation, Extensive ties found between Sanjeev Gupta bank and business empire, raised questions about the independence of a bank owned by the metals magnate Sanjeev Gupta. David Blood talked us through how they used APIs to follow the money:
“The core of the story was our discovery that, of the bank’s security interests registered with Companies House in the form of ‘charges’, almost two-thirds were held against Gupta-linked companies.
A charge gives a creditor security over the assets of a debtor. Banks often use charges to secure collateral for loans or other credit facilities. In the UK, most types of security interest must be registered with Companies House, the corporate register. However, Companies House doesn’t provide functionality for searching for charges. In order to find all the charges held by the bank, we had to scrape the Companies House API for all charges held by all 4.4m active UK companies.
I wrote a series of Python scripts for scraping data and retrieving PDF documents from the API. I used pandas in a Jupyter notebook to load and filter the data and identify the charges held by the bank. The Companies House API is quite well documented, which is not always the case, unfortunately, but was certainly helpful in reporting this story.”
Our next conversation
Last month, ProPublica launched their Guide to Collaborative Data Journalism, revealing their secrets from successful partner-based projects like Electionland and Documenting Hate. Joining to answer your questions about the guide and building data journalism coalitions, we’ll have Rachel Glickhouse with us in our next edition. Comment with your questions!
As always, don’t forget to comment with what (or who!) you’d like us to feature in our future editions.
Until next time,
Madolyn from the EJC Data team