With 2021 coming to a close, there is no better time to take a look back at some of the most memorable data journalism projects published this year.
While COVID-19 has continued to consume the news agenda, data journalists have also covered other critical topics this year. Among them include climate change to big tech, health, sexual violence and Afghanistan.
With the world experiencing an increase in severe weather in 2021, many data journalists produced data stories focusing on climate justice. The COP26 UN Summit in Glasgow, Scotland also spurred journalists to cover environmental issues.
Holding big tech accountable was another theme that continued to surface this year. This led journalists from several outlets to examine the repercussions of algorithms on society, highlighting corporate negligence.
From COVID-19 to other outbreaks, health remained an important topic driving much data journalism coverage in 2021.
The DataJournalism.com team has picked our 12 favourite data journalism projects of 2021. Below, we examine what went into each story.
Here is our selection (in no particular order):
- Pandora Papers - ICIJ
- Afghanistan, visualising 20 years of war -Al Jazeera
- Hot and humid Olympic summer - Reuters Graphics
- Digital Violence: How the NSO group enables state terror - Forensic Architecture
- The underwater ‘hotspot’ feeding La Palma’s volcano will create new islands - El País
- Following the science - The Pudding
- COVID-19 vaccination tracker - Reuters
- America’s food safety system failed to stop a salmonella epidemic. It’s still making people sick - ProPublica
- Sexual violence in Singapore - Kontinentalist
- The climate crisis, by the numbers: Your guide to humanity’s greatest challenge - BuzzFeed News
- Visualised: Glaciers then and now - The Guardian
- What It takes to understand a variant - The New York Times
Billed as the "largest investigation in the history of journalism," the Pandora Papers, run by International Consortium of Investigative Journalists (ICIJ), has uncovered the offshore dealings of 35 current and former heads of state and more than 300 other current and former public officials and politicians around the world. This kind of international project is a prime example of collaborative data journalism at its best. In fact, the Pandora Papers involved over 600 journalists from 150 news outlets who collectively analysed nearly 12 million documents leaked by 14 offshore providers.
So what are the essential tools did ICIJ use to coordinate such an investigation? Pierre Romera, chief technology officer of ICIJ, spoke to us about the ins and outs of his work. He opened up about the tech platforms used to dive into the data and help journalists find stories for their investigations. One of those is Datashare, "a tool that distributes text extraction from documents across many servers. With Datashare, you can ask dozens of servers to work on the document to put it into a search engine and extract the text from the document."
Another key tool he mentioned is the iHub. This is "the ICIJ's digital newsroom, a platform that allows journalists to share insights, leads, testimonies, videos and anything they produce related to the investigation."
It's a process Romera called "radical sharing," meaning that all collaborators are encouraged to share everything they have found in their digging. In this way, they have also managed to circumvent censorship and publish pieces that would otherwise have been banned.
The U.S. withdrawal from Afghanistan and the subsequent takeover of the country by the Taliban was one of the biggest news stories of 2021.
Afghanistan, Visualising the impact of 20 years of war by Al-Jazeera provides readers with a glimpse into the human side of the data, which along with other visual stories, videos, and character-driven features helped draw attention to this historic event.
Mohammed El-Haddad, the interactive editor for Al Jazeera English's online department, explains, "The most difficult part of the story was getting meaningful and complete data from all 34 provinces of Afghanistan to capture as accurately as possible what 20 years of war did to the Afghan people. The backbone of the story was eight infographics, each measuring an aspect of human suffering as a result of decades of war. These graphics range from quantifying the human cost of war, to summarising the history and geography of the conflict, to breaking down the economic cost."
El-Haddad ensured that the story's graphics were designed to be shared as social media cards across digital platforms. This meant that each graphic had to provide easily digestible information.
El-Haddad used several tools for the project, including Tabula to extract data from PDF documents; Google Sheets for data cleaning and analysis; and Adobe Illustrator to present the data.
The story page was packaged together with full-screen images and a written narrative while the site was built using AMP -- a mobile-first web framework that provides faster load times and SEO.
The site continues to generate a lot of long-tail search traffic worldwide, showing that audiences are still interested in understanding the origins and consequences of this seemingly endless war.
The 2020 Olympics were hosted earlier this year...in 2021. As the single most-watched sporting event on the planet, Tokyo hosted the games this past July.
Reuters Graphics' team brought the extreme weather conditions into focus with a great visual in this article.
Minami Funakoshi and Ally J. Levine illustrated that this year’s games could be one of the hottest summer Olympics ever, which meant athletes not only had to deal with COVID-19 but also scorching temperatures and high humidity. The fact that heatstroke and COVID-19 share some similar symptoms only made matters worse.
To create such a visually impactful article, Funakoshi told us, "It took us a while to figure out how to easily follow the steps and make the animations and transitions smooth. But it was definitely worth the effort."
Levine noted the importance of picking the most critical details for the final design: "The illustration explaining how sweat cools the body went through many iterations where we added and removed information to make the graphic as clear as possible."
To build the project, the team used D3.js, Canvas, Svelte and Adobe Illustrator. "
Editor Simon Scarr explained that the biggest challenge for him involved making sure the narrative was as clear as possible in order to guide the reader through each step.
Digital Violence: How the NSO group enables state terror uncovers the impact of the Pegasus malware, the source of what has been referred to as the most sophisticated surveillance attack ever, developed by Israeli cyber surveillance firm NSO Group.
Created by Forensic Architecture in collaboration with Digital Labs and Amnesty International, the platform mapped the global landscape of cyber-infections for the first time, identifying the journalists, activists and human rights defenders targeted by Pegasus and telling the stories of those who were intimidated.
The project's purpose examines the relations between the full range of NSO activities around the world, according to Shourideh Molavi, the researcher in charge at Forensic Architecture.
The platform, which Pegasus Project leaks have updated since its launch in July 2021, contains around 2,000 data points with information on export licenses, alleged purchases, digital infection attempts, and events in the physical world related to the people reportedly targeted, such as intimidation, assault, defamation, and even murder. The team developed custom open-source software to present this data as an interactive 3D platformthat shows the NSO's documented areas of operation and is time-organised.
In the "Pegasus Stories" video series, short films tell the stories of civil society actors targeted by Pegasus for the first time. They detail the experience of surveillance as personalised terror that takes a psychological toll within networks of collaboration and friendship and their resistance and perseverance in the face of it. In addition, the data can be listened to via a data sonification created by Brian Eno.
Molavi explained that many challenges existed throughout the investigation -- not least how advanced these cyberweapons are and how difficult they are for experts to track down. But the most challenging part was "the realisation of how vulnerable we and other human rights defenders are to this increased level of violence by state and corporate actors."
The platform will continue to be updated as the investigation continues.
This year's volcanic eruption in La Palma, Spain, captured the world's attention. For data journalists at El País', it led to them publishing The underwater 'hotspot' feeding La Palma's volcano will create new islands.
As the most-read article in El Pais' science section, the piece was written by Nuño Domínguez and illustrated by Mariano Zafra. Originally written in Spanish and later translated into English, the piece demonstrated the volcanic origin of the Canary Islands by showing the power of this destructive and natural phenomenon.
What's most captivating about this example is that it effectively exploited the intersection between science, data, and current events. The amount of collaboration between the designer and journalist is also no minor task. The extensive amount of information came from local government sources, public records, the Copernicus satellite system. The team used Bing Maps to develop the project.
To develop a vaccine in record-breaking time is no task for a single person. In fact, it takes many institutions, even more scientists, and their individual findings, to deliver a vaccine.
To demonstrate this, data visualisation designer Jeff MacInnes published Following The Science for The Pudding.
“I was floored by the sheer number of COVID-related research articles that came out in 2020," said MacInnes. "A major goal of this project was to invite the public for a behind-the-scenes peek at just how many researchers were working on this problem.”
As he explained, science communication has the tendency to focus only on the results, which at times may cast a shadow over the process. Beyond neglecting the indebtedness of an article to all the publications that allowed this new knowledge to come to light, there are several other unintended consequences.
“It promotes the misleading notion that science is definitive and absolute, which in turn can lead to public distrust whenever knowledge improves and new findings seem to contradict older ones. Relatedly, it also misses the opportunity to get the public more comfortable with uncertainty,” said MacInnes.
The interactive piece showed the number of papers related to the COVID-19 that were released in PubMed, the biggest library of scientific journals. It also revealed the connections between those by examining the inter-institutional ties on a map, and by linking the papers together in an interactive network graph.
The most challenging and time-consuming part was geocoding the collaborations among researchers. He explained all he had to work with was the author affiliations as reported by each article.
The data looked particularly messy here: different ways of typing the same institution, different ways to type in the information. When you are handling 93,593 journal articles, generating 2,955,500 total collaborations, the lack of structure in the data is understood.
To build the interactive piece, MacInnes used React with a mix of three.js, d3.js, p5.js. As for scraping, wrangling and processing the results, he used python libraries and jupiter notebooks.
This year, many data journalists faced the challenge of creating live trackers that pulled data from different sources to generate real-time graphics with text, telling us the story of the pandemic. One example that stood out was Reuters’ COVID-19 vaccination tracker.
The team behind the tracker created a sleek-looking webpage filled with comprehensive data on vaccine uptake around the world, including the possibility to compare countries by geographic region and income level.
The tracker is managed through a combined effort of a large number of reporters at Reuters, sharing tasks ranging from bringing together frontend and backend development to gathering and analysing information on countries’ vaccination policies and vaccine uptake.
We spoke to Prasanta Kumar Dutta, from Reuters’ design and development team, about what differentiated this tracker from the rest: “Reuters also took on the hard task of tracking the vaccination policies that determined who had access to COVID-19 vaccines in countries around the world. That meant creating a unifying structure to catalogue the different phases of each country's vaccination rollout plan.”
Standardising around 200 countries’ policies came with the added challenge that any individual country's plans could change at any time during the rollout phase.
The Reuters COVID-19 vaccination tracker brings together information on the vaccination rate for each country, vaccines eligibility, and the impact of vaccination policy campaigns on pandemic curves. “All these combined, tells an important part of the pandemic story as countries began to mitigate the spread of the virus with vaccines,” said Dutta.
To answer questions on vaccine uptake and eligibility, the tracker’s data also powers the COVID-19 vaccine experience on Amazon Alexa in a number of countries.
The tracker was designed first on pen and paper, then in Adobe XD, to finally be coded in D3.js. To automate and schedule tasks, Reuters uses GitHub Actions, while the trackers was built on React.js and Next.js.
It's not often a reporter can analyse genomic sequencing data to investigate the persistence of a salmonella bacterial strain responsible for making thousands of Americans sick. But at ProPublica, the skills of reporter and PhD graduate Irena Hwang meant just that.
When in 2019, the U.S. Department of Agriculture closed an investigation about a salmonella outbreak in the country, people may have assumed it was resolved. However, The ProPublica piece showed that authorities stopped investigating the outbreak despite available evidence revealing that it was far from over.
The team did this by examining publicly available data on the genetic properties of the salmonella cases in the country. They were able to assess the genetic similarities of the bacterial strain outbreak to more recent cases across the country.
Looking for the source of the outbreak, reporters at ProPublica brought together this publicly available data with FOI requested information about the salmonella infected poultry plants and the epidemiological information of the people who got sick. While the source was difficult to find, merging this data allowed them to create a tool, helping consumers look up the location of the cases.
Hwang told us: "There was this moment when I almost downloaded terabytes of data and set up a cloud-based, custom bioinformatic analysis pipeline. But then I realised that it just would not be the best use of my time. I decided that it was better to leave the hard-core sciencing to the scientists. [...] As a journalist, my job is to reveal how an event unfolded and to find hidden stories in data."
To analyse the data, Hwang usedPython (pandas) on jupyter laband some command-line tools to extract data from public APIs. To wrangle the data, she first converted giant TSV files to SQL databases using the DB Browser for SQLite and then queried the SQL database through the sqlite3 python package.
This 8-month investigation revealed that there are serious weaknesses in the United States when it comes to food safety. "Perhaps stories like ours can help propel the food system into a new era, where the latest in scientific knowledge and advancements are leveraged for improved food safety," said Hwang.
This year Kontinentalist and Women Unbounded teamed up to investigate sexual violence in Singapore. This was the country’s first-ever data-driven story about the sexual violence epidemic.
The piece puts a spotlight on the key patterns around sexual violence, including the average age of the victim, the nature of the relationship between victim and perpetrator, and the vulnerability of victims. While in recent years the law in Singapore has stepped up to tackle the issue, internalised shame and misplaced blame mean many cases go unreported.
“Singapore’s society and mainstream media have often perpetuated rape myths and unhelpful attitudes about sexuality and gender that fail to constructively discuss and resolve the problem of sexual assault and violence,” said Mick Yang from The Kontinentalist. “The advent of the #MeToo movement brought survivors’ voices to the fore, but sceptics often saw a trickle of one-off cases—mistakenly thinking the blame lay with a ‘problematic woman’—rather than understanding that this was a systemic problem.”
According to Yang, the most challenging aspects of the piece were gathering and analysing the data. With a lack of official data and FOI mechanisms in Singapore, the team relied on manually digging for articles and examining each one individually. Another challenge involved deciding on the relevant parameters for what data to keep and what to bin.
The scrollytelling experience was generated using Flourish and the scrollama.js library, whereas Procreate, Figma, and Adobe Illustrator were used for the illustrations.
From severe flooding in Germany to uncontrollable wildfires in Greece and unprecedented snowfall in Spain, there was no shortage of extreme weather in 2021. What's more, scientists attributed these events to climate change.
UN Summit COP26 also provided an opportunity for world leaders to come together to tackle what naturalist David Attenborough called "the biggest threat to security that modern humans have ever faced".
With so much focus on climate justice, it is no surprise environmental issues received as much news coverage as they did.
Due to reporting on single events, it can become challenging to put the pieces together and obtain a clear and cohesive narrative providing the big picture to the general public. This was what BuzzFeed News' science newsdesk had in mind, noted reporter Peter Aldhous.
Published during COP26, Aldhous and fellow BuzzFeed News colleague Zahra Hirji released a top to bottom summary of the key figures, with reliable data, and a brief narrative about climate change.
To shift public perception on this issue, Aldhous explained it is vital to ensure such stories focus on the human cost of the crisis and put pressure on those in power.
According to Aldhous, the team used R written scripts to handle the data. Most charts were built with Datawrapper using BuzzFeed News' customised design.
One powerful way to tell the story of climate change is to showcase changes over time. In Visualised: Glaciers Then and Now, The Guardian’s Niko Kommenda walks us through a visual exploration of the shrinkage of 90 of the largest and most surveyed glaciers across the globe.
By selecting glaciers from as many regions as possible, the article showed that big glacier size changes have occurred over time. As noted in the article, the consequences are substantial, as nearly two billion people in the world depend on glaciers as their primary water supply.
Using data from the Global Land Ice Measurements from Space (Glims), the piece shows the extent of the melting over the span of time the data exists for each glacier. However, as Kommenda wrote in the article, the database comes with some challenges.
“In some cases, an apparent change in a glacier’s extent can be caused by different teams of researchers measuring it differently. But the vast majority of glaciers are losing more ice than they accumulate because global temperatures are much higher today than they were in pre-industrial times,” said Kommenda.
To understand the limitations of the data, he reached out to the scientific community who built the database.
When it comes to COVID-19 coverage, audiences now more than ever are seeking clear and relevant information to help them stay safe. Given the amount of uncertainty and constantly changing landscape of case rates, deaths, vaccines, variants, and government restrictions, public service journalism has never been more important.
In this uncertain environment, sometimes the best approach is to focus on one aspect of the pandemic and explain it in a guide-like manner with the latest verified information. This is what Amy Schoenfeld Walker and Lazaro Gamio at The New York Times delivered in What It Takes to Understand a Variant. This explainer piece walks us through the necessary steps to comprehend a variant, whether that be Omicron or a future one.
The pair explain four key aspects of assessing a variant: 1) sequencing and tracking cases, 2) pinpointing the variant’s transmissibility, 3) investigating existing immunity, and 4) determining the variant’s severity. For each section, they clarify where we currently stand and what is next, using simple charts and colours to guide the reader.
The article quantifies the efforts and steps required by the scientific community to navigate the COVID-19 pandemic.
This year data journalism has continued to thrive, despite challenges imposed by the ongoing pandemic.
Reporters across the globe have come together to hold those in power accountable. This has generated the biggest investigation in journalism history and shows us what is possible through large-scale collaboration.
Many reporters who worked on 2021’s chosen stories involved wrangling an abundance of data where journalists must carefully find the narrative, in what may be mistaken for a straightforward task.
Visually, data journalism continues to pair with crafted, interactive forms of storytelling, generating memorable and informative experiences. This can be the product of a single reporter managing all aspects of story development or involve a joint effort across newsrooms.
With this roundup, we wrap up the year. Needless to say, we are really excited to see what data journalists will bring to the news ecosystem in 2022! Is there a project we missed? Which one caught your eye? Join us on Discord and let us know!
In Following The Science, we stated that the “scientific community” often focuses on the results, taking away the attention from the process. This is incorrect. I’m fact, what Jeff MacInnes said is that “science communication” has a tendency to focus on the results, not the “scientific community” per se. The article has since been corrected to reflect this.