datajournalism.com https://datajournalism.com/ Learn new data skills, increase your knowledge and advance your career. en-us Wed, 03 Apr 2024 08:03:05 +0200 Wed, 03 Apr 2024 08:03:05 +0200 Climate change through a solutions and data lens https://datajournalism.com/read/longreads/climate-change-through-a-solutions-and-data-lens Mon, 04 Dec 2023 13:00:00 +0100 Sherry Ricchiardi https://datajournalism.com/read/longreads/climate-change-through-a-solutions-and-data-lens Reporter Zuha Siddiqui stood atop a pile of rubble overlooking a waterway choked with plastic and toxic waste, signalling the “climate apocalypse” facing Karachi, Pakistan’s largest city. A stream, the colour of soot, flowed beneath the debris.

“If you stand here and trip by accident, you will fall to your death,” warned a local guide, recounting how a 65-year-old man died after slipping into the festering pit near his home. The sprawling city, situated along the Arabian Sea coast, has earned a dubious distinction.

Karachi, population 17.5 million, is listed among the world’s least livable cities, ranking 168th out of 172 on The Economist’s 2022 Global Livability Index. It was a perfect setting for The Sinking Cities Project, a global cross-border investigation examining how sea-level rise impacts major cities and how governments respond to the climate crisis. Solutions were part of the mix.

The Sinking Cities Project, a global cross-border investigation that examines how sea-level rise impacts major cities and how their governments respond to the consequences of the climate crisis.

The project, published by Unbias the News, a self-described “feminist cross-border newsroom,” brought together six local journalists from cities threatened by rising seas in Europe, Asia and Africa. Nearly 1.8 billion people in the world live in areas with increasing exposure to flooding and storm surges.

Siddiqui’s story documented how local officials failed to mitigate the harmful effects of climate change, including upkeep of the city’s stormwater drains, a safeguard during heavy rains. She also addressed the question: Who is looking for solutions?

She talked with activists who organised a “People’s Climate March” to highlight the crisis and how it is being mishandled.

“Having a working-class, grassroots movement leading the Climate March and protesting [to the government] for equitable solutions helps build awareness, both at the civil society and government level. That has made the biggest difference,” said Siddiqui. She also told of climate advocates taking the city to court to stop a construction project harmful to the environment.

Scenes from the Climate March Karachi, as covered in The Sinking Cities project. Photo Credit: Zuha Siddiqui.

The Karachi story was part of The Sinking Cities Project that won the 2023 Climate Journalism Award, hosted by the European Journalism Centre and sponsored by Google News Initiative.

Covering climate change through a solutions lens is picking up steam. It is not about telling positive, pat-on-the-back stories about climate. The solutions journalism model uses investigative techniques and data that rely on scrupulous reporting to identify policies and practices to help mitigate the crisis.

“Solutions reporting seeks out responses that are working and puts a spotlight on the places that are verifiably getting it right,” said Matthew Kauffman, data manager for Solutions Journalism Network (SJN). He noted in an SJN blog, “When news outlets ask and answer the question: “Who’s doing it better?” they help their audience see and explore possible opportunities for change.”

How is that concept playing out in today’s newsrooms?

When news outlets ask and answer the question: “Who’s doing it better?” they help their audience see and explore possible opportunities for change.

A watershed moment

News outlets have warned about climate change for decades, predicting impending doom. An overload of negativity can “freak people out,” as one environmentalist put it, and switch them off. Headlines about the planet heating up have become so routine they hardly constitute news anymore.

The UN describes the climate crisis as “the biggest threat modern humans have ever faced.” Records show that the last decade was the hottest in human history; wildfires, floods and droughts have become the new normal. That poses a quandary for journalists.

“We must close the significant gaps between what audiences may need from climate change news coverage if they were to get more engaged in the story and what news providers cover. Could journalists inadvertently be contributing to climate inaction?” asked a World Association of News Publishers report.

For the past 20 years, noted scholar Maxwell Boykoff has researched how media cover climate change. He has seen evidence of audiences being turned off.

“I have written about how doom and gloom reporting has been found by researchers to raise awareness, but it may effectively paralyse people from taking action. It can be overwhelming,” said Boykoff, who leads the Media and Climate Change Observatory (MeCCO) at the University of Colorado.

There still is not anywhere near enough coverage about the many aspects of a changing climate as it relates to human and non-human communities and individuals.

The observatory monitors 131 sources across newspapers, radio and TV in 59 countries in seven different regions of the world to measure trends in climate coverage. Data is assembled via Nexis Uni, Proquest and Factiva databases.

“There still is not anywhere near enough coverage about the many aspects of a changing climate (causes, consequences and solutions) as it relates to human and non-human communities and individuals,” said Boykoff.

Researchers analysing the cumulative mentions of climate change terms in Google news article headlines found that 336,000 stories contained the term climate change. Of those, only 0.2% contained “Climate change solutions” – 0.5% included “climate change” and “solutions.” WAN-IFRA published this in February 2022.

Those numbers could improve as solution journalists carve out new territory and put a spotlight on what can and is being done to help the planet survive. It is public service journalism at its best.

Solutions and data team up

Solution Journalism Network's Matthew Kauffman sees new patterns of climate coverage emerging. He described three approaches on how data interacts with solutions journalism with links to climate stories that break the mould of doomsday reporting. Here are the three approaches:

Kauffman, an award-winning investigative journalist, advises reporters to “rely on rigorous evidence to identify policies and practices with a proven track record [on climate]. There are cases where the data essentially are the solution, cases where various data points serve as resources that communities can use to make smart decisions about climate solutions.”

Asking the right questions is an important part of Kauffman’s equation. The Earth Journalism Network (EJN) suggests starting with a local example of action on climate change and tying that to a broader trend or issue.

Among the key questions, EJN suggests:

  • Where did this idea come from?
  • What evidence is there to show the solutions are working?
  • What do researchers say?
  • What do the numbers show?
  • Who are the critics, and what do they say?
  • What metrics matter when it comes to measuring success?
  • Is what’s happening in one place with solutions a model for somewhere else?

The BBC award-winning piece by freelance journalist Catherine Davison explores how Bangladesh has developed climate resilience with few resources in the face of severe storms and natural disasters due to climate change.

That line of inquiry is reflected in "The country trailblazing the fight against disasters", a BBC Future Planet report written by freelance journalist Catherine Davison.

It won the 2023 Climate Journalism Award for the Storytelling and Solutions category and was described as “an exemplary piece of solutions reporting.” The story combined empirical evidence and data with input from those on the frontlines of climate change in Bangladesh, one of the world’s most disaster-prone countries.

An excerpt from the piece illustrates the solution angle: “Bangladesh’s system has become renowned for increasing the country’s resilience with relatively few resources, with its success lauded by experts as a model for other low-income countries looking to develop early warning system in the face of a changing climate.”

The story described a multi-layered early warning system of weather monitoring equipment, communications operations and a large network of volunteers, half of them women, that get the word out.

Volunteers in Bangladesh's Cyclone Preparedness Programme take part in an early warning drill in Chila village, April 2022 (Photo Credit: Catherine Davison)

Other examples of climate solutions journalism:

Stories like these can make a difference. A University of Maryland study said that news consumers exposed to coverage of solutions felt they could better influence climate change policy and support actions to address it. Focusing only on the negative aspects of climate change gives a false impression that there is nothing that can be done about it, the report found.

Focusing only on the negative aspects of climate change gives a false impression that there is nothing that can be done about it.

Evolutionary path

Climate journalism got off to a shaky start when the term “global warming” first surfaced in print on August 8, 1975 in the journal Science.

From the beginning, media outlets were thrust into a controversy between those who considered climate change a major threat to the planet versus climate deniers who viewed it as a hoax, fake news or political conspiracy. A challenge to the tenet of fairness and balance in newsgathering came into play. As part of his research, Professor Boykoff zeroed in on the term “balance as bias” – defined as false balance in the news, also known as “bothsidesism” -- and how it affected the media’s climate coverage.

A study he co-authored in 2004 found that “U.S. media outlets consistently reported both climate denial and climate science in a balanced manner, leading to biased overall coverage of climate change by implying that both views had equal evidence in favour of them.” Bothsidesism, in effect, diluted the perils of climate change.

“Reports often failed to show that climate change is not a single issue. It’s an intersectional set of challenges that flow through every aspect of the way we work, play and relax in society,” noted the professor.

Over time, the mass of scientific evidence moulded the media’s climate agenda. Newsrooms developed climate beats staffed by journalists specialising in the environment. They devoted investigative projects, special sections and interactive visuals to climate solutions. Many created climate teams.

Two groups that need more media attention: climate sceptics and communities hard-hit by the crisis.

In September, The New York Times brought together world leaders, activists, scientists and policymakers to examine the actions needed to confront climate change. Among them, Bill Gates, Al Gore, and Michael R. Bloomberg. There still are gaps to be filled.

Data journalist Eva Constantaras conducts workshops on reporting on climate solutions around the world. She identified two groups that need more media attention: climate sceptics and communities hard-hit by the crisis. She says deniers need information to help them break out of the cycle of polarization. Vulnerable populations need it to protect themselves from increasingly dangerous conditions.

"Now that data-driven climate reporting has matured, it is time for the media to take on the tough task of extending coverage to hard-to-reach audiences,” said Constantaras, data editor for Lighthouse Reports and Internews. She advises, “Lay out your editorial strategy and impact goals in the short, medium and long term and make sure your [climate] coverage is helping advance toward those goals.”

Climate solutions journalism is a work in progress. At stake is a habitable planet with journalists on the frontlines of the fight to save it. The axiom “Everyone is a climate reporter now” rings true as climate coverage seeps into every newsroom beat. It’s the slant of the story that makes the difference.

Resources that can help

Covering Climate Now: Offers a Climate Solutions Reporting Guide that lists categories of stories to cover and questions to jump-start interviews on a variety of topics from politics and government to economics, technology and culture. Excellent tool for self-driven learning or newsroom staff development. Checklist of story ideas and sources at the end of the guide.

Solutions Journalism Network: Provides a framework for solutions reporting, with examples, tips, guidelines and a tracker for solution stories. Guides reporters through a list of questions: “What are the solutions being put forward – and how has that solution worked, or not worked?” What evidence, such as on-the-ground experience or empirical data, indicates that the solution in effective?” Excellent tool for developing new skills or polishing old ones on solutions reporting.

Oxford Climate Journalism Network: Supports a global community of reporters and editors across beats and platforms to improve the quality, understanding and impact of climate coverage worldwide. A programme of the Reuters Institute for the Study of Journalism. Includes a Global South Climate Database of scientists and experts from Asia, Africa, Latin America and the Pacific.

Project Drawdown: Focuses on science-based solutions and strategies, shifting from “doom and gloom” to “possibility and opportunity” and giving voice to “underrepresented climate heroes” through storytelling. A course, Climate Solutions 101, jump-starts reporters interested in climate change.

Earth News Network: Article on “Reporting on climate change through a solutions lens” provides a list of dos and don’ts for climate reporters. A sampling: Use data! Ground lived experiences in research and vice versa; don’t promote silver bullets or one-size-fits-all solutions and avoid false balance. There is a section on questions for solutions interviews and writing tips.

]]>
New climate metrics for new climate conversations https://datajournalism.com/read/longreads/new-climate-metrics-for-new-climate-conversations Mon, 02 Oct 2023 12:00:00 +0200 Duncan Geere https://datajournalism.com/read/longreads/new-climate-metrics-for-new-climate-conversations Cast your mind back to the height of the Covid-19 pandemic, when the world seemed to have a single shared obsession - the “R number”. This simple metric, which measured the reproductive rate of the virus, gave a tangible sense of the invisible threat that lurked around us. Was it above one, meaning that the virus was spreading exponentially? Or had lockdowns and other public health measures successfully brought it below that threshold? The R number became a predictor of the public mood like a barometer - forecasting impending storms, or clear skies.

Covid is very much still out there, but swift measures on social distancing, travel restrictions, and vaccine distribution mean that it’s no longer seen by most as a major threat to humankind. These policies demonstrated what fast action, strong leadership and global cooperation can achieve, but their success was judged almost entirely by that R number.

This brings us to a crucial question. Do we need new metrics for communicating climate change?

Now that the urgency of the pandemic has receded for much of the world, another major threat looms larger than it ever has before - and it’s one that policymakers have so far failed to get a grip on. Climate change is already dramatically raising the likelihood of extreme weather events, causing shortages of food, water and other crucial necessities, and displacing people from their homes with increasing regularity.

In light of the swift action that we saw on Covid, why don’t we see a similar sense of urgency on what is arguably a much larger problem? One that threatens not just our health, but the habitability of our planet?

The answer is complex, and multifaceted. But surely part of it is that we lack a compelling, easy-to-understand “R number” for climate change - a single, digestible indicator that can galvanise public attention and be used to judge the success (or lack thereof) of political action. The ways in which humans are changing our planet are measured in various ways, but none of these metrics resonate with the immediate and clarity that the R number brought to the Covid-19 pandemic.

This brings us to a crucial question. Do we need new metrics for communicating climate change? Metrics that clarify the stakes, and guide effective policy? Metrics that bridge the gap between the abstract and invisible threat of climate change, and the concrete steps needed for action? Metrics that could transform public understanding, and finally instil the urgency that the escalating climate crisis desperately demands.

Kris de Meyer, director of the Climate Action Unit at UCL, thinks we do. His team has spent the last year developing a prototype of a new climate dashboard which is inspired by the R-number. The dashboard, which brings together the causes of climate change, the rate at which it's happening, and the impacts that it's having on our weather systems, is designed to offer journalists and other science communicators a set of tools to tell stories about the changes that humans are making to the planet, and how they’re linked to the experiences people have in their day-to-day lives.

Photo by UCL Climate Action Unit is licensed under CC BY 2.0.

It does this in just three numbers. “The first metric is the Earth's energy imbalance and that is the primary driver of climate change,” de Meyer explains. “It's the difference between the amount of energy coming into the Earth system from the Sun, and the amount of energy that the Earth radiates back into space. We're losing some heat into space, but there's a constant stream of energy arriving from the Sun.”

If climate change wasn’t happening, that number would be hovering around zero, meaning that the amount of energy coming in and the amount going out is more or less equal. Unfortunately, human activity has pushed it out of balance. “We're adding more energy to the Earth's system than we're losing back into space, which means that the Earth is storing that energy and that energy is leading to all sorts of downstream changes in weather,” explains de Meyer.

The second metric is the speed at which the Earth is warming, expressed as the number of degrees per decade. “The reason why we picked the speed of warming is because it tells us something about what is happening to global temperatures, or regional temperatures, right now,” says de Meyer. He argues that the traditional climate metric of how much warmer the Earth is than during the pre-industrial period hides the real pace of change. “Most of the 1.2 degrees that we've had has been happening in the last 40 years alone,” he says.

Finally, the third metric featured in the dashboard ties directly into people’s experiences - it’s an index of the ‘unusualness’ of the weather we’re experiencing. “Every event where the temperature is being broken is being counted and then compared to the amount of these record-breaking temperatures that we would have if climate change wasn't happening,” de Meyer explains.

“Where the speed of warming tells us something about the average, this tells us something about the extremes. Climate change is changing ocean currents and wind currents. It is changing precipitation. It is making heatwaves longer and stronger, as we’re seeing at this very moment in much of Europe and the US. It can also generate more storms and change seasonal wind patterns, like monsoons. So by absorbing all of that energy, trapping all of that energy in the Earth's system, we are supercharging the weather.”

Together, de Meyer believes, these three metrics tell a compelling story that link the physical causes of climate change to people’s experiences of how the weather is changing.

You'll notice several things missing from these numbers that are commonly seen in communication around climate change. There’s no sea level rise. There are no tonnes of carbon, or parts per million of greenhouse gases. There’s no “net zero” or “1.5C”. There aren’t any friendly “human” comparisons - like bathtubs full of gasoline, or barrels of oil being stacked to the moon.

Photo by UCL Climate Action Unit is licensed under CC BY 2.0.

This is intentional, says Lucy Hubble-Rose, deputy director of the UCL Climate Action Unit. “The way that these climate metrics have been used in the past is really as headline statements. A new threshold will be crossed or we'll reach a threshold, say, in parts per million, and that will be reported as a headline within the media. But what we weren't seeing, and what we felt was a gap, was that there weren't metrics which can act as tools.”

Take the heatwaves of summer 2023 as an example, says Hubble-Rose. “It was notable to me that in a number of places they were being reported, the words climate change were never mentioned. That's partly because journalists don't have a metric that they can reach for in the storytelling of what's happening - it's difficult to say: ‘This is climate change.’ You could reach for something like [our dashboard’s] unusualness metric in that situation, or the speed of warming metric, as a bit of context for what is happening in relation to climate change.”

Because of that, we know that it's important that reporting of climate change needs to start to come into lots and lots of other stories where it’s a factor, even if perhaps not the only factor.

Similarly, the team did some tracking on media coverage of notable climate news stories. The release of a report from the Intergovernmental Panel on Climate Change (IPCC), for example. “After three days it tails off,” says de Meyer. “The first day it was everywhere and then by a week there was no reporting on it at all,” adds Hubble-Rose. “Because of that, we know that it's important that reporting of climate change needs to start to come into lots and lots of other stories where it’s a factor, even if perhaps not the only factor.”

Hubble-Rose also notes that our existing metrics are primarily designed for a conversation about whether climate change is happening or not, and that’s an argument that’s now largely over. “There is a lot of agreement now that climate change is happening”, she says. “The main thing that people are trying to do now is to either effect change or to explain how fast it's happening or to explain how it's happening in the context of what we're seeing in the world. So those metrics which were historically extremely useful to be able to say, ‘Look, climate change is real,’ are less useful in telling those stories of what is happening right now.”

The first seed of the project was planted in July 2021 when an email landed in de Meyer's inbox, inviting him to a workshop with a collection of communicators, journalists, and scientists. The focus of the event was on what climate communicators could learn from how the R number was being used in reporting on COVID-19.

“I saw a lot of big names in the invitation list”, de Meyer says, “and with my hat on as a workshop facilitator I was thinking: this would really benefit from good facilitation to get as much information as we can out of those big names.” So he reached out to the organisers - the WWF and the Quadrature Climate Foundation - and asked if they needed help with the design and facilitation. “They said, ‘yes’, and so we started to work with the teams that were putting this workshop together.”

The result was an hour-long discussion where the experts in attendance discussed what made the R number popular, what made it a good representation of risk, what made it useful to communicators, and how those learnings could be applied to climate. The team collected together the conversations and analysed them, then went back to WWF and QCF with some suggestions for how the organisers could take the project forward. “They said, ‘Would you want to do this? Because we don't have the capacity to do this,’” de Meyer recalls. So his team wrote up a project proposal, secured funding from QCF, and got to work.

The first step was to gather a list of candidate metrics, and start narrowing them down - the team ran a series of scoping meetings with scientists, data experts and journalists, testing different ideas and their pros and cons. There wasn’t always agreement. “​Rather than trying to accommodate the opinion of every scientist in the room, we set it up in terms of trade-offs,” says de Meyer. “Then we analysed that information and read it back to them and said, ‘Based on everything you said, we want to go for this trade off. Is that acceptable to you?’”

Photo by UCL Climate Action Unit is licensed under CC BY 2.0.

Early on, they found that data wasn't really the problem. There was a lot of data. The problem was about making that data more accessible to communicators and the public. They hunted for metrics that were closely linked to the system, that were easy to understand and remember, that had a natural threshold between “good and bad” (like the above/below 1 threshold of the R number), and that showed the immediate situation, not an average or projection over many decades. They also looked at scientific robustness, availability of data, usefulness for comms, and more - eventually settling on four metrics - the three listed above, as well as a fourth “people-related impact metric”.

This last metric proved particularly thorny to nail down. “It was meant to be one that looked at what is happening to people,” says de Meyer. “But very quickly, when we started to talk to some people about this, we realised what a can of worms it is in terms of how much people are exposed to something versus how much they are vulnerable to the consequences of it.”

As an example, think about how a heatwave would affect someone in a country where air conditioning is commonly available, versus how it would affect someone in a country where it is not. A person in the latter country could be exposed to less extreme weather, while being much more impacted due to their vulnerability to it. “We felt that we were going to be sidetracked in trying to resolve those differences before we could do anything else,” says de Meyer. So it was put to one side. “We would like to come back to it with the experience that we've gathered later on.”

With the core metrics decided upon, the next step was to design the dashboard. This is where Data4Change stepped in -- a nonprofit that works with organisations tackling social, political and environmental issues to help them deliver greater impact. “They were asking questions that we would never have thought to ask,” says Hubble-Rose.

The Data4Change team got deep into the details, looking at the different metrics and how they could be clearly conveyed to a non-expert audience. “They would ask questions like, why do you sometimes use the word rate and why do you sometimes use the word speed? Are they the same thing or not?” says Freya Roberts, the project manager for the UCL Climate Action unit. “They were spotting these moments of inconsistency or us flipping between a more scientific word and a more lay word and saying to us, ‘Why are you using both? Do they mean something different?’”

Another example was an analogy that was being used for the scale of the Earth’s energy imbalance. “It was comparing the total energy imbalance over the entire surface of the Earth to the amount of energy that humanity is using to do all of the things that we are doing. And it's a factor of about 30 difference,” says de Meyer. “The way we had explained that was really not clear. They were asking these really probing questions. And while trying to answer those questions we were like, ‘Ah, maybe we need to come up with a different analogy. Because there are far too many different ways that different audiences could be misinterpreting this, could be hearing something else than what we're trying to explain.’”

The Data4Change team then held a two-and-a-half-day sprint in London, getting together creatives and data experts into three teams working on different prototypes. “The first and perhaps biggest challenge we faced was distilling years worth of scientific research into creative briefs,” says Stina Bäcker, co-founder and head of operations at Data4Change. “Since scientists and designers speak very different languages, and because climate science is such a nuanced and complex topic, it took a lot of effort to make sure we created briefing materials for the sprint that had all the scientific smarts but didn't make you feel like you needed a PhD in environmental science to understand.”

“We pride ourselves on being able to communicate complex concepts well,” says de Meyer, “but having to explain it to them in the first instance, when we started to work on those design briefs, really made us have to think really hard about how to simplify even further than we would normally do. That really brought us to a place where a lot of things that we thought were necessary to explain, actually weren't necessary to explain.”

Photo by UCL Climate Action Unit is licensed under CC BY 2.0.

After the prototypes were created, they were tested with the kinds of audiences who might actually use the product. “We really liked the ‘socialisation’ events that UCL’s Climate Action Unit put on for the intended target audiences which in this case were journalists, media networks and policymakers,” says Bäcker. “This comprehensive user testing led us to a solution that resonated with diverse audiences.”

These sessions weren’t just about feedback, though. They were also about making these targeted audiences aware of the project, giving them an early glimpse of how it works and the opportunity to shape that, and onboarding them on how to use it. “We will definitely try and emulate this for some of our projects in the future,” says Bäcker.

Once the designs had gone through user testing, Data4Change brought in the team at Italian design and innovation firm Accurat to fine-tune the designs into three clear, user-friendly and engaging visuals. “Having also participated in the sprint—with a representative in every team—they were adept at understanding the project nuances and goals,” says Bäcker. “This issue deeply resonates with them, and they committed significant resources to see the project to its completion.”

Photo by UCL Climate Action Unit is licensed under CC BY 2.0.

The team now has a static prototype of the dashboard, but there’s more work to be done before it can be formally launched. “We uncovered some data gaps that exist in terms of how soon or how quickly the data is made available.” says de Meyer. Unlike COVID data, which was updated on a more-or-less daily basis, climate data has a lag of a couple of months. “The data gets captured on a daily basis, it just doesn't get accumulated and integrated on a daily basis. That process takes a lot longer at the moment. We need to go to the providers of these datasets and we need to ask them what needs to happen on their end before that data can be made available.”

Then there’s that missing fourth metric, and perhaps even a fifth. “We didn't just develop the dashboard, we also developed a way to bring scientists and designers and journalists in conversation about changing how climate change is communicated,” says de Meyer. “That process can be helped with further development, which is best done through the development of other metrics. Like the impact metric, or an action-based metric that tells us something about how much progress we are making in tackling climate change.”

Photo by UCL Climate Action Unit is licensed under CC BY 2.0.

The team says it’s important, too, that further socialisation work is done - introducing it carefully to journalistic audiences beyond the climate or science desk. “While climate change is often confined to scientific or political discourse, it intersects with areas like sports, fashion, music, tech, and more,” says Bäcker. “By weaving a climate narrative into unconventional topics, journalists can capture diverse audiences' attention.”

“Equipping journalists with these metrics can foster more insightful discussions between them and climate scientists, catering to their specific reporting niches,” adds Bäcker. “By drawing clear connections between climate change and its direct impact, it's hoped that the public will be more inclined to hold power structures accountable, and also to take individual action on climate change. And it’s imperative not to focus on the doom and gloom, but on the tangible actions that we can still take to mitigate the effects of climate change.”

“I think of it as a smartphone,” says de Meyer. “Once the first smartphone came on the market, we didn't really know all the different ways that smartphones were going to change our lives, and these metrics are similar to that. We don't fully understand or can predict how they will change the way that we talk about climate change. But we think that they can.”

More resources for climate communication

]]>
Data gives voice to women on climate change https://datajournalism.com/read/longreads/data-gives-voice-to-women-on-climate-change Mon, 04 Sep 2023 13:00:00 +0200 Sherry Ricchiardi https://datajournalism.com/read/longreads/data-gives-voice-to-women-on-climate-change Women in action

On a sunny afternoon in March, a group of elderly Swiss activists walked out of a courthouse in Strasbourg to a round of applause and chants of “bravo.” Supporters blew bubbles and rang cowbells to celebrate their landmark climate case. “They treated us like heroines,” said Elisabeth Stern, 75, who was part of the delegation representing the 2,400-strong Senior Women for Climate Change Protection, supported by Greenpeace Switzerland.

Theirs is the first climate case ever heard by the European Court of Human Rights (ECHR), a last resort when legal options run out at home -- Swiss judges refused to hear the case. A win in Strasbourg could set legal precedent for EU’s 27 member states and beyond. The “Climate Grannies,” as CNN dubbed them, argued that their government is violating their fundamental rights to health by “woefully inadequate efforts” to address global warming. They cited research showing that elderly women are more at risk of dying during heatwaves than men.

Switzerland is warming at more than twice the global rate, its glaciers are melting fast, and the country has been hit by extreme heat this summer. The climate activists – average age 75 – drew media coverage from as far away as the Times of India and South China Morning Post. A headline from the New York Times read, “Grandmothers of the World, Unite”. Deutsche Welle posted a video, “Climate Seniors are taking on Switzerland.”

At 75, Elisabeth Stern is taking the Swiss government to court over climate change with an association called Senior Women for Climate Protection. Source: © Joël Hunn/Greenpeace

“This is an historic moment. We see ourselves as tremendous agents of social change,” said Stern, a retired anthropologist who grew up in the shadow of the Alps. In July, she retreated to the mountains to escape the heat in Zurich. Women are in the eye of the storm with climate disasters. What price are they paying? Are journalists paying attention?

Climate change is not gender-neutral and should not be reported that way.

How women are affected

The U.N. states it bluntly: “Gender inequality coupled with the climate crisis is one of the greatest challenges of our time. It poses threats to ways of life, livelihoods, health, safety and security for women and girls around the world.”

From the New York Times: “Although climate change is a collective problem, its burdens — displacement, homelessness, poverty, sexual violence, disease — weigh more heavily on women and girls . . . 80 percent of people displaced by climate change are women.”

The Centre for Climate Justice based at Glasgow Caledonian University explains to the BBC that climate change affects women more for two reasons: “Extreme weather disasters intensify existing inequalities in society. Women don't have good enough representation at climate talks to have their say on effective solutions.”

What should data journalists take away from this?

  • Climate change is not gender-neutral and should not be reported that way
  • The impact of climate change amplifies gender inequalities, an important part of the story.
  • Sex-disaggregated data -- data collected and tabulated separately for women and men -- is vital for fair reporting on climate issues.

By segmenting data by gender, region, or marginalised groups, journalists can uncover inequalities and issues hidden in aggregated data. It becomes a matter of analysing and curating the numbers to cull out storylines. “It is critical in understanding what is truly happening. We have seen this time and again, where men are the ones who are reflected in the data, but when you disaggregate it, you get a very different picture,” said Fara Warner, Solutions Journalism Network's director of climate.

She advises reporters to “broaden and deepen” their sources, focusing on scientists and researchers who are working directly with communities most affected by climate change. “This is such a ripe topic to investigate. Add in solutions, and it becomes a place where any journalist can make a mark. This is especially true because the connection between gender and climate is under-covered in the media,” said Warner. That leads to the question: Who is working to change that?

Data shows that more women die prematurely than men due to environmental degradation.

Filling the gender data gap

U.N. statistician Iliana Vaca Trigo sees efforts being made to fill gender data voids. According to Trico, the lack of segmented data “has been acknowledged in the gender-environment nexus” and action is being taken. She cited efforts by the U.N. to mainstream gender into climate change data. Trigo provided links to resources that could be helpful to journalists. They are listed in the resource section at the end of this article.

Along the same line, an Organization for Economic Co-operation and Development (OECD) report cited a “growing recognition of the need for a gender lens to understand the impact of environmental factors on well-being.”

OECD provides evidence and analysis on the gender-climate nexus, i.e., gendered impacts of climate change. For instance, data shows that more women die prematurely than men due to environmental degradation, face greater economic insecurity and are more often displaced by climate-forced migration.

Open-access data tools from Data2X are another valuable resource. A link to the World Bank Gender Data Portal, with more than 900 indicators in an accessible and usable format, is available on the website.

Also worth checking, “Gender Data Solutions Inventory” and “Solutions to Close Gender Data Gaps,” with over 140 practical tips, including environment. The inventory includes The Gender Climate Tracker App, information on policies, research, and actions related to gender and climate change.

These sources share a commonality. They provide sex-disaggregated data as a tool for fair and balanced reporting on climate and women.

“It allows us to see the full picture of climate change, revealing inequities and deprivations,” said Lindsay Bigda, communication manager for Women’s Environment and Development Organization (WEDO). “Although it has been neglected in the past, there are indications of a growing movement that indicates uptake in disaggregated data as we consider how crucial it is to produce comprehensive reporting and solutions,” she added.

A study by the International Union for Conservation of Nature (IUCN) illustrates Bigda’s point. Researchers used collated data and case studies from over 1,000 sources to document links between environmental pressures and violence against women and girls. Some of the findings were stunning.

From the report, “Conflict over access to scarce resources can give rise to practices such as `sex-for-fish’ where fishermen refuse to sell to women if they do not engage in sex,” a practice seen in parts of Eastern and Southern Africa.

The report linked gender-based violence to environmental crimes, such as sex exploitation around illegal mining. “The damage humanity is inflicting on nature can also fuel violence against women around the world – a link that has so far been largely ignored,” said IUCN acting director Grethel Aguilar when the study was released.

Women and children carry water in India. Due to a lack of piped water, many women in the global south have to obtain water from natural resources walking kilometres potentially exposing them to gender-based violence. (Photo: © Ray Witlin / World Bank)

At times, progress has been slow. The European Institute for Gender Equality (EIGE) monitors women in positions of power in environment and climate within the EU. In November 2021, they reported that women were “woefully under-represented” in EU member states.

Datasets started from 2012 and showed where progress had – and had not – been made. The data can be found in the Women and men in decision-making entry point of EIGE’s Gender Statistics Database. EIGE has been tasked by the European Commission with cross-data collection and analysis on issues related to gender and environment, said Ligia Nobrega, EIGE expert for gender statistics.

She would like to see more reporting on the relevance of mainstreaming a gender perspective throughout all key sections related to the environment and climate change, such as energy, transport, industry, economy, and agriculture.

“You can also portray stories of women and men challenging gender stereotypes in climate-related sectors or portraying how women and men, in all their diversity, impact and are impacted, by the climate differently,” said Nobrega.

Women are on the frontlines of the climate crisis. Who is telling their stories?

The storytellers

While covering the 26th UN Climate Change Conference of the Parties, COP26, in Glasgow, Washington Post climate reporter Kasha Patel heard young women describe how extreme weather closed or destroyed their schools, deepening inequalities in education. Research shows girls are less likely to return to the classroom after a crisis.

“We often talk about the effects of climate change on our atmosphere, but this discussion showed it can also indirectly affect social issues and become a `threat multiplier,’” said Patel. She reported, “More extreme weather is taking girls out of school, forcing them into earlier marriages and increasing their exposure to violence.”

Patel used data to show how girls in low-income countries are disproportionately impacted by climate change. Some of the findings were jaw-dropping.

Based on research from the Malala Fund, in 2021, climate-related events prevented at least 4 million girls in poor countries from finishing their education. By 2025, that number could climb to more than 12.5 million due to flooding, droughts, and increased exposure to disease.

A UNICEF study calculated the number of hours women and girls spend collecting water each day – 200 million, equaling 8.3 million days or over 22,800 years. If water is contaminated or dried up due to droughts, they are forced to walk longer distances. “When water is inaccessible, they cannot practice proper menstrual hygiene and miss class until their period is over,” wrote Patel. “If extreme weather damages crops, girls may skip school to spend more time in the field to make up for losses.”

She cited U.N. data projecting rises in global temperatures and increases in heavy precipitation, placing women at greater risk. “These types of stories show the lasting effect that climate change will have on generations, setting back social progress in many regions of the world,” Patel wrote in the Post.

Malala Yousafzai is the youngest Nobel Prize winner and an activist for girls' education. Photo by Southbank Centre London is licensed under CC BY 2.0.

Case in point, Reuters reported from Jacobabad, Pakistan, the “hottest city on earth,” on how mothers and pregnant women were surviving.

Data from the meta-analysis showed for every one degree of Celsius in temperature rise, the number of stillbirths and premature deliveries increased by 5 percent. Statistics were gathered by several international research institutions and published in the British Medical Journal.

The story cited an analysis of 70 studies that found pregnant women exposed to heat for prolonged periods are at higher risk for complications.

On another front, a story by the Fuller Project, published in the Washington Post, reported a link between violence against women and weather disasters.

A study in Kenya, based on satellite and national health survey data, showed domestic violence rose by 60 percent in areas that experienced severe weather. That analysis, and 40 others published as part of a global review in the journal The Lancet, found similar results.

“Heat waves, floods, climate-induced disasters increase sexual harassment, mental and physical abuse, femicide, reduced economic and educational opportunity and increase the risk of trafficking due to forced migration,” Terry McGovern, head of Columbia University’s department of Population and Family Health, told Fuller Project reporters.

This kind of journalism raises awareness and ensures that the plight of women will not be forgotten. That is particularly important since government statistics often fall short.

What governments are missing

In November 2022, world leaders at COP27 lined up for the obligatory group photo. Of the 110, only seven were women, the lowest concentration at a U.N. climate summit, according to the Women’s Environment and Development Organization (WEDO), which tracks female participation.

World leaders pose for a photo during the Climate Change Implementation Summit of the UNFCCC in Sharm El-Sheikh. Photo by UN Climate Change is licensed under CC BY 2.0.

An analysis of climate change stories during COP26 found only 2% contained a gender angle. The proportion of stories on gender-related issues declined from 2.2% in 2017 to 1.4% in 2021. A study by Data2X found that 60 percent of countries don’t have data on how women are affected by climate change, and no globally agreed-upon data framework exists to monitor gender and climate.

Women typically are not part of governmental policymaking on the environment or included in decision-making bodies. According to Oxfam, women make up less than 24% of the world’s parliamentarians and 5% of its mayors.

Yet, the notion of women as climate change agents is gaining momentum. COP27’s Gender Day pushed for integrating female perspectives into climate policies, strategies and financial decisions. Women were heralded as key drivers of climate solutions. Climate stories often portray women as victims or survivors, which often is the case. They also have another role: agents of change.

Women as change agents

A Glasgow-based programme called Gilded Lily runs workshops to empower women to speak out and become engaged with climate change. Zarina Ahmad, one of the leaders, sets a defiant tone: “Look at women for solutions and resilience -- and don't speak on behalf of women, which is what we often get, especially women of colour. Give us space, let us have our voices, and let us be heard.”

Ahmad’s words ring true for Las Chelemeras, a group of women in the Mexican port town of Chelem who has restored more than 60 percent of the state reserve of swamps and mangroves on the north coast of Yucatan.

A story in El Pais International described them as local heroes, breathing new life into Yucatan’s mangrove forests that act as deterrents to climate change. The muddy soil that mangroves live in is carbon-rich. Their strong root systems help protect coastal communities from extreme weather events like hurricanes.

El Pais reported, “before entering the swamps, the voice of these women went unheard.” Today, their project helps to recover ecosystems that had been lost due to environmental degradation and urban development.

Above is a screenshot of the El Pais International article showing how a group of 18 women of Mayan origin from the same fishing village have spent more than a decade journeying into the swamps, protecting and regenerating various ecosystems.

A CNN report “Solar Sisters to Waste Warriors”, highlighted a project in Tanzania where women learn to install solar equipment and build sustainable gas systems to reduce dependence on firewood for cooking and lighting homes.

For a UK-led initiative called eXXpedition, 300 women from 100 countries sailed to remote areas to combat plastic pollution. The all-female crews collect samples of water, sand and air and analyse how they have been contaminated by plastic waste, reported CNN.

When it comes to women combating climate, there is no shortage of stories waiting to be told.

Look at women for solutions and resilience, and don't speak on behalf of women.

Solutions Journalism Network (SJN) provides guidelines for conducting investigations “to hold accountable solutions put forth by governments, businesses, nonprofits, and other stakeholders.” They relate to reporting on women and climate, such as:

  • What is the solution being put forward, and how has that solution worked or not worked?
  • What evidence, such as on-the-ground experience or empirical data, indicates that the solution is effective?
  • Has it been effective in communities most affected by the climate crisis?
  • What insights and information could help other stakeholders better respond to the problem?

SJN’s Solutions Story Tracker, a curated database of solutions stories, includes hundreds focused on climate. How many of these articles could have included a gender angle but missed the boat?

Coveringclimatenow.org (CCN), another excellent source for solutions’ reporting, includes sources for story ideas and a climate solutions guide.

Among CCN’s best practices for climate journalists:

  • Know your audience
  • Humanise and localise the story
  • Know the science, but talk like a real person
  • Tell the whole story, including solutions
  • Do not give a platform to climate deniers
  • Remember, climate is a story for every beat

The bottom line, journalists are at the heart of solution reporting. Whether writing about 150-degree temperatures in the Middle East or droughts in Africa, they are well-positioned to address deep-rooted climate inequalities around the globe. That is part of their watchdog role.

Solutions Journalism Network's Fara Warner believes, “We will be held accountable for how we cover this climate crisis. We can do better.” Her words have a ring of truth as global warming tightens its grip.

Resources for stories on women and climate

U.N. statistician Iliana Vaca Trigo shared the following references:

Other sources for climate change and gender

]]>
Kill switch: reporting on and during internet shutdowns https://datajournalism.com/read/longreads/internet-shutdowns-data-reporting Wed, 08 Feb 2023 00:00:00 +0100 Sabrina Faramarzi https://datajournalism.com/read/longreads/internet-shutdowns-data-reporting On the 16th of September, Iranian woman Mahsa Amini died in a hospital in Tehran. Her death became a catalyst for protests across Iran and is being cited as “Iran’s George Floyd moment”. As a way to quell the outrage, the Iranian government then carried out a number of shutdowns in several areas of the country and blocked popular mobile messaging apps.

This isn’t the first time Iran has led a series of internet shutdowns and censorship - they have a long history of it - nor is it the only country in the world to utilise such techniques. Shutdowns are emerging as a common practice, used by governments across the world to suppress dissent.

Data shows that shutdowns and applications of censorship are becoming more and more common across the world. In just the first half of 2022, there was a 22% increase in shutdowns compared to the previous year, impacting 1.89 billion citizens globally.

Because internet shutdowns exist on a spectrum, they can be difficult to pinpoint, verify, understand and as a result, report on. T'his article offers a summary for data journalists to begin understanding how these shutdowns happen, how to investigate them and how to begin measuring them so we can better report on them.

What are internet shutdowns?

Internet shutdowns are a tool for information control. The practice of shutting down and censoring the internet is emerging as a key human rights issue, violating citizens’ freedom of expression, right to assembly and access to information. A 2021 report by Google’s Jigsaw project found that internet shutdowns are increasing at an ‘exponential’ rate, ‘threatening civil society’. But how do we define a shutdown?

According to Access Now, an international human rights organisation founded in 2009 as a response to the Iranian election that year, an internet shutdown is “an intentional disruption of internet or electronic communications, rendering them inaccessible or effectively unusable, for a specific population or within a location, often to exert control over the flow of information.” This can happen on a countrywide level, in specific regions or for specific networks. Shutdowns can happen for any period of time - a few hours, a couple of weeks , or even several months. The pro-democracy protests in Hong Kong in 2019 led to the Chinese government imposing a new national security law which gave them the power to shut down internet services in the city. This law also granted the government the power to censure online content and arrest individuals for posting or spreading "fake news" or "hate speech" on social media platforms and in the case of Tigray, Ethiopia, citizens have been experiencing an internet shutdown that has lasted more than two years, becoming one of the world’s longest and silencing the voices of over six million people.

“Things are kind of coming full circle unfortunately, but we were basically founded as an emergency response tool,” says Zach Rosson, Data Analyst for the #KeepItOn campaign at Access Now. Access Now’s #KeepItOn campaign is an international association of digital advocacy groups that reports on the digital rights of users at risk around the world. They have a 24/7 hotline where citizens, civil rights groups, activists and journalists can call in if they are experiencing an internet shutdown, as well as handbooks, digital safety tips and other resources to help people stay on top of their digital right to internet access. “Given how entrenched the internet is for increasingly larger percentages of the population around the world, there's a fundamental human rights aspect when it comes to shutting down the internet.”

However, when it comes to internet censorship, it is about the blocking of “specific websites and applications,” says Maria Xynou, internet censorship researcher & community lead at OONI (Open Observatory of Network Interference), a global community that has been measuring Internet censorship since 2012. “The main difference between shutdown and censorship is basically that with censorship, we're referring to the targeted shutdown of specific services.” (To learn more about internet censorship, OONI’s Maria Xynou has a free course online at Advocacy Assembly).

The impact of shutdowns

Internet shutdowns can affect so many areas of life - and livelihood. In Iran, the economic cost of internet disruptions and mobile outages or restrictions costs the country $37 million a day, according to internet monitoring group NetBlocks. This is likely why Iran has decided to target certain apps instead of the sweeping blackouts during the previous swell of protests back in 2019.

So why should journalists care? Because government-instigated internet shutdowns are a violation of citizens’ democratic rights. They are happening more often, lasting longer, becoming more sophisticated and harder to detect. The incidences of global shutdowns remain high, with 196 documented incidents in 2018, 213 incidents in 2019, and 155 in 2020, according to a paper from the Carnegie Endowment for International Peace.

Cutting access to stifle public dissent is an increasing global trend that affects journalists and activists significantly, among others. In its 2022 report, the UN Human Rights Office warns that internet shutdowns have a ‘dramatic’ impact on people's lives and human rights as access to the internet is a democratic right.

At a very basic level, internet shutdowns harm the capacity for journalists to do their job in the first place, such as connecting with sources, conducting research and publishing articles. However in many countries, this might not be as blatant. For example, Russia has been blocking independent media websites and blogs, effectively silencing independent reporting. “In order to be able to defend human rights on the internet and to be able to ensure that our human rights are actually protected in this online world, it is necessary to be able to have transparency of any controls that are implemented,” says Xynou.

She goes on to explain that prior to the new wave of dedicated tools and organisations committed to measuring and publishing internet shutdowns, reports were mostly anecdotal. “There are so many reasons why services may be inaccessible, which may have nothing to do with intentional government censorship,” she says. “So distinguishing accidental accessibility versus intentional government blocking is actually a difficult problem, and it’s definitely not something that is obvious to a normal internet user.”

Xynou explains that measuring tools are so key because governments who have purposefully turned off internet access can seek plausible deniability, making it more difficult for journalists to hold them to account. “If there's no data that can serve as evidence that they intentionally blocked the service, governments can deny it,” says Xynou. This is why it’s crucial for data journalists to be involved in the reporting of shutdowns. Only with data, can journalists prove there is a government-instigated shutdown happening and not only report on the shutdown, but also use these shutdowns as a starting point for further stories. “For example, in countries where LGBTQ websites are blocked, there is likely little transparency or public debate about that in the country, because LGBTQ rights are not recognised or not really protected,” says Xynou. “But having the data that can serve as evidence about exactly which specific LGBTQ communities are being affected.” She explains that this is important because coverage of internet censorship often only focuses on when major social platforms - like WhatsApp or Facebook - get blocked, because they have such a huge user base.

But the biggest risks are when platforms that serve marginalised communities get blocked - which often go unnoticed. “Platforms run by and for marginalised communities are at higher risk of being censored but it's not easy for these cases to receive attention and reporting,” says Xynou. “It is these sorts of cases that you can uncover and report on through censorship measurements.”

Understanding and measuring shutdowns, blocks and censorship

There are many ways for people to push back against internet blackouts and network disruptions, but journalists first need to begin to understand them in order to report on them. Because the internet is a network of networks - not a single one - there are many ways in which a shutdown can occur. In order to understand internet shutdowns, it’s important to first understand how the internet works in order to know how to measure it. “The internet was not made to be measured,” says Amanda Meng, Research Scientist at Georgia Tech. “I think that context is really helpful for people just to understand the kind of hairiness or complexity to it in the first instance.”

Meng is part of the research group for IODA (Internet Outage Detection and Analysis), a research group and tool that monitors Internet outages in near-real time. Her colleague, Zachary Bischof, a research scientist whose work focuses on geolocating and mapping internet outages, explains that even within these tools, it can be difficult to be specific, which is why getting the data is just the first step.

“[Regarding Ukraine], if there's damage to some infrastructure, we can say, oh, that connectivity is reduced to say 50% or something. That's a partial outage, but then it might level off there, and then maybe there's another drop later,” says Bischoff. He explains that it is sometimes difficult to pinpoint exactly when an outage has begun and stopped, or define whether it was government imposed or not, which is why journalists need to verify the information from other sources, too.

But there are ways to anticipate outages during certain events where governments may impose a shutdown. Shutdowns often happen during five types of events:

-Mass demonstrations -Military operations and coups -Elections -Communal violence -Religious holidays -School exams

All of these event types are incidents where governments want to restrict the flow of information, the ability to organise and freedom of speech.

There are many types of internet blocks (and depending on where you look they can be defined differently), but generally, there are 3 main forms of internet blocking:

  • Full, blanket shutdown This is where governments have used the ‘kill switch’ to shut off all access to the internet in a country or region. This is the most drastic of tactics and can happen by completely shutting off network services, though the process is complicated. In cases like these, VPNs and other tools used to circumvent internet blocks cannot be used.

  • Platform-specific blocking Often known as partial shutdowns, platform-blocking often targets specific sites and apps while the rest of the internet is fine to use. Often, these are usually social media apps (like Instagram, Telegram, WhatsApp, Twitter and Facebook) that governments want to close off to derail people from sharing content and organising.

  • Bandwidth throttling The deliberate slowing down of the internet so much it becomes unusable. Because bandwidth is defined as ‘the maximum amount of data transmitted over an internet connection in a given amount of time’,connections with higher bandwidths can send data faster than those with lower bandwidth. Bandwidth throttling is when the amount of data that can be sent is being limited, making the sending and receiving of data at very low speeds, particularly multimedia data such as photographs and videos that are usually wanting to be shared or live streamed at protests, for example. This can be instigated through Internet Service Providers (ISPs). Access Now has a handy report on the taxonomy behind network interference.

But there are so many other ways in which governments can create internet blocks of which are much more subtle and harder to verify:

  • Mobile Data Shutoff This is when governments shut off mobile data so the internet cannot be accessed with portable devices. This is a common tactic for countries where people earn low-incomes and access to computers is limited but smartphone ownership is common. This often helps achieve a complete shutdown without the complications of killing the switch.

  • DNS Interference When governments try to derail specific websites, the domain name system (DNS) that you are trying to access either goes to the wrong site or shows an error message.

  • Denial of Service This is when governments send so many requests to specific websites or apps so that it slows down or crashes. VPNs requesting from another country may work in this case.

  • IP Blocking Devices and servers connected to the internet have unique keys called IP Addresses. Internet Service Providers can use these IP addresses to very precisely shut them down in specific regions, whilst others remain intact.

Identifying and verifying shutdowns

But how can journalists verify a shutdown is government-instigated? And where should they start if they suspect an internet block is taking place? There are some key online tools, dashboards and organisations that track and measure a shutdown to collect and publish it.

Here are just a few:

Launched in 2017, The Icarus Project is a technical research laboratory dedicated for testing, analysing, documenting, and developing Internet censorship circumvention solutions. The project collects as many techniques as possible and documents them in step-by-step guides. The project is currently documenting censorship circumvention techniques in the Egyptian context for broader access.

Access Now is a non-profit organisation founded in 2009 with a mission to defend and extend the digital civil rights of people, providing an index of internet shutdowns around the world.

Their tools, called the OONI Probe, allows users to measure internet blockages and online censorship in their area. The OONI Explorer dashboard showcases the OONI Measurement Aggregation Toolkit (MAT) and is a huge, regularly updated dataset showing internet blockages.

IODA (Internet Outage Detection and Analysis) is a project from the Georgia Institute of Technology. It is a prototype system that monitors Internet outages in near-real time. Launched in 2011 by Alberto Dainotti as a way to measure the shutdowns in Egypt and Libya during the Arab Spring. It is a usable tool but also a research project.

An internet censorship observatory
A censorship measurement platform that collects data using multiple remote measurement techniques in more than 200 countries. Their methods mean they don’t have to rely on accessible vantage points or volunteers in different countries, ‘surpassing scale, coverage, continuity, and safety limitations’.

Launched in 2010, the Google Transparency Report shares data on the actions of governments and corporations affecting privacy, security, and access to information online. They log the number of visits to every Google product in real-time, along with an approximation of the geographic region where the visit originated. This means that journalists and other users can check a decrease in traffic in a specific region, which may mean that users there cannot access a product or service. They also publish reports on content removals and requests from governments around the world.

Reporting during a shutdown

Internet shutdowns can happen anywhere, and are not unheard of in the western world either. The first known instance of a government-instigated internet outage in the US happened on August 11, 2011 during the protest of a police shooting of an unarmed passenger. The government agency in charge of San Francisco’s subway system, shut down mobile service at four stations to suppress the protests.

But what is it like to report during a shutdown? In 2021, India had the world’s highest number of internet shutdowns for four years straight, with a particularly dangerous shutdown during the coronavirus pandemic. Safina Nabi, a freelance multimedia journalist based in the Indian part of Kashmir, explains what it was like to report from there during an internet shutdown by the Indian government over several months from August 4, 2019 to March 4, 2020 in what was deemed ‘the longest internet shutdown on record in a democracy’.

“I was among few women journalists on ground who were reporting and it was really difficult because initially we didn't know what was happening,” says Nabi. “Nothing was working.” There was also no transportation, so Nabi and other journalists at the time would walk to their sources, colleagues to report.

Then shutdown was a response to anticipated unrest from a decision by India to revoke the special status of its portion of Kashmir, known as Jammu and Kashmir and fully integrate its only Muslim-majority region with the rest of the country. The initial shutdown was not just limited to the internet but mobile phone networks, landlines, cable and television channels. For 72 days, the people in India Kashmir were completely disconnected from all communications (telephone lines were reinstated again).

The internet was later turned on, but at such slow speeds and with limited access. It was only until February 5, 2021, when 4G mobile data services were reinstated. Before then, many journalists were making weekly trips to their offices in Delhi to deliver their reports. But as a freelance journalist, Nabi did not have the resources to fly to Delhi each week to hand in her stories. She would visit the airport in Kashmir, try to find someone travelling out, hand them a USB drive with her stories and request that they email the documents on it once they land in a place connected to the internet. “It was not easy, but there were lots of strangers willing to help each other,” says Nabi. “It’s not like they were putting a communication bar on me - they were putting a communication bar on 4 million people, and those 4 million people were against India and India's concept of democracy.”

After some weeks, international pressure meant that some communications were reinstated. A media centre was later opened, with - according to Nabi - five or six desktop computers running on a 2G network serving approximately 300 journalists, of which Nabi was just one of around eight women journalists. It was crowded, and each journalist usually only had a 10 minute window to use the computer and the internet to submit their stories. “Journalists would wait in line for hours,” says Nabi.

Despite 2G internet connections being reinstated, connection generally was very slow, so citizens started using VPNs (virtual private networks) as all the major social media sites were blocked. “Once the government authorities were using VPNs to share their feelings, they announced that anybody who is using a VPN will be jailed for six months straight without bail.” In Iran, citizens have been using tools such as Snowflake - a Google Chrome extension that lets users bypass censorship.

The shutdown in Kashmir also coincided with the coronavirus crisis, which heavily relied on internet communications about the disease. “We were pushed into a black hole where we had no access to information. Information that could have saved you or killed you,” says Nabi. “As journalists we have to advocate for internet access. When the government shuts down the internet, they are shutting down a person's basic human right to reach out to another person.”

Collaboration and educating citizens

More and more, we are also seeing shutdowns spill over borders, which inevitably require journalists to lean on their colleagues in neighbouring countries which can be a great way to collaborate and verify stories.

Shutdowns can also spread, showing that the actions of one country often don’t just cause impact there. Global interconnectedness means that internet shutdowns can have huge ripple effects in all areas of life.

Access Now published a guide to the countries that hit the kill switch most often in 2021 as well as an annual Election Watch. Considering that shutdowns are becoming an increasingly common tactic for governments, journalists reporting on and in these countries need to understand how they work and how they will be key to their reporting in 2023.

But should journalists offer more ways for citizens to understand internet shutdowns? “I think journalists play a very, very critical role in informing the public about internet censorship and increasing awareness that this is an issue that almost every country in the world experiences,” says Xynou. She explains that internet shutdowns and censorship also changes over time, not only in how the blocks are implemented but who the blocks impact, too. “That governments around the world have the ability to censor and control what is accessed on the internet in itself is concerning, because in order to ensure that what they're censoring doesn't impact human rights and doesn't impact independent journalism, there needs to be transparency.”

Xynou believes it’s key that journalists stay on top of what is being censored and relay that to the public, because many blocks - particularly those that impact vulnerable communities - are often completely missed.” The internet is part of our world, so I believe it is a journalist's responsibility to monitor what governments control on the internet, just as they feel that they have to have responsibility to monitor how governments exercise control of society.”

]]>
Data drives media coverage of climate refugees https://datajournalism.com/read/longreads/data-coverage-of-climate-refugees Tue, 13 Dec 2022 09:30:00 +0100 Sherry Ricchiardi https://datajournalism.com/read/longreads/data-coverage-of-climate-refugees Data has become a springboard for journalists on the frontlines of the climate refugee crisis. It points them to weather emergencies in hot zones like South Asia and Central America and to humans facing misery and despair.

Jorge A., a Guatemalan farmer lost his corn crop to floods. He planted okra, but a drought killed it off. He feared if he didn’t get his family out, they, too, might die.

Jorge’s story was told in gripping detail in a data-driven investigation by ProPublica in partnership with The New York Times Magazine, exploring how changes in population patterns could lead to catastrophe. The “Great Climate Migration Has Begun,” presented as a visual essay, cited scenarios of how this crisis might play out.

The joint venture, supported by the Pulitzer Center, had an over-arching strategy: To model, for the first time, how climate refugees might move across international borders. The modeling informed the journalist’s findings and “possible general pathways for the future.”

“Should the flight away from hot climates reach the scale that current research suggests is likely, it will amount to a vast remapping of the world’s population,” wrote ProPublica’s Abrahm Lustgarten, lead author for the 2020 series.

Climate scientists have sounded the alarm for decades. A sample of the evidence:

ALTA VERAPAZ, GUATEMALA. Carlos Tiul, an Indigenous farmer whose maize crop has failed, with his children. Photo by Meridith Kohut

ProPublica and the Times provided a glimpse into the future. The three-part series explored how climate migration could spark massive population shifts and, in the process, remake the world order. According to the reporting, no country stands to gain more from the climate crisis than Russia. Climate maps showed a transformed United States.

“The issue of climate-induced migration is all encompassing. It will affect everything. The cost of resisting the new climate reality is mounting,” said Lustgarten, ProPublica’s senior environmental reporter. He is working on a book about how climate migration could reshape America.

How the Model Was Created

The project team turned to a “supercomputer” housed in a U.S. government facility in Cheyenne, Wyo., to process more than 10 billion data points into their model. It took four days for the machine, run by the National Center for Atmospheric Research, to calculate the answers.

They contracted with Bryan Jones, an expert in modeling at the City University of New York, to build a climate migration model for the project similar to one he created for the World Bank’s Groundswell report. It included policy recommendations to help slow factors driving climate migration.

“The models are a matrix of a number of scenarios and variables, some local and some customized for us... It includes all the standard climate forecast models, and all the standard SSP development scenario models used by the U.N., plus a bunch of added datasets I introduced, like for water availability, crop yields and growing seasons,” explained Lustgarten. (SSP refers to Shared Socioeconomic Pathways.)

The model in part one of the migration series focused on Central America and Mexico. Lustgarten noted, “This is a global problem. There are hotspots: North Africa, South Asia, and central America. I chose Central America because it also borders the US and was newsworthy because we have an immigration debate, and caravans of migrants were coming at the time.”

The next two articles were based on a different data approach sourced from Rhodium Group, which provides research, data and analytics on global topics.

EL PASO. A mother and daughter from Central America, hoping for asylum, turning themselves in to Border Patrol agents. Photo by Meridith Kohut

The team turned to top climate scientists for peer reviews and critique of the modeling approach. The goal was not to provide concrete predictions, but to show what the future might hold.

“Our model offers something far more potentially valuable to policy makers: a detailed look at the staggering human suffering that will be inflicted if countries shut their doors,” said Lustgarten. He predicts the impact of climate change “almost certainly will be the greatest wave of global migration the world has seen.”

Is the press corps prepared to meet that challenge?

Redefining the Concept of Objectivity

Journalists have taken a stand on how they cover the climate beat. Their view of what constitutes a “balanced news report” has shifted from “he said, she said” objectivity toward a “weight of evidence” approach. Mainstream media are giving climate skeptics less time and for good reason.

Researchers long had raised concerns that the media distorted scientific consensus on climate change by “false balance” reporting or “bothsidesism,” giving climate deniers too much say. Research by Northwestern University psychology professor David Rapp sheds light on the controversy.

During a co-authored study, experiments were conducted to test how people would respond when two views about climate change were presented as equally valid, even though one side was based on scientific consensus and the other on denial. Among the conclusions, “When both sides of an argument are presented, people tend to have lower estimates about scientific consensus and seem to be less likely to believe climate change is something to worry about.” A campus publication touted, “Northwestern research finds ‘bothsidesism’ in journalism undermines science.”

“The most important finding, to my mind, is that exposure to unsubstantiated viewpoints, pitched as reasonable alternatives, can be problematic. Making two sides appear to hold analogous evidence and support, when they do not, creates a real sense of false equivalency,” said Rapp, who researches language, memory and why people are so susceptible to misinformation.

His suggestion to journalists: “Contemplate providing a clear indication and detail as to the expert consensus underlying debated viewpoints, rather than just presenting those viewpoints on their own. For example, offering statements from people who hold the view that climate change is not something to worry about could benefit from also indicating that such a view runs counter to the consensus view of climate scientists who study these issues and have collected vast amounts of data on the topic.”

Courtney Perkins, a senior writer for CNN International, produced a study along those same lines. Her research on redefining balance, concluded, “Journalists are largely abandoning the`both-sides’ method of covering the environment to protect their stories’ accuracy.”

Perkins, a University of Nebraska master’s candidate, added: “It is up to us, the world’s communicators, to convey the seriousness of climate change to the pubic in hopes of spurring action to address existential environmental threats.” Those words have a ring of truth as migration draws more media attention.

Expanding Climate Coverage

Three data journalists are on the Associated Press’ 20-person climate team created by the wire service earlier this year. Two dig through statistics in search of stories, the third works with visualization. “We use a lot of data in our stories. I will give you two examples,” said Peter Prengaman, an AP veteran tapped to lead the new initiative.

The first, “Fight over human harm, huge climate costs,” reported on loss and damage caused by climate change. A bar chart identifed the 20 countries that have done the most damage. The United States, China, European Union and Russia topped the list.

The second article analyzed electricity disturbance data submitted by utilities to the U.S. Department of Energy to identify weather-related outages in the United States as climate catastrophes spread.

“Climate change intersects with all aspects of life. If it worsens – and it is getting worse as the planet heats up – there will be more climate disasters. We felt we really needed to ramp up our coverage,” said Prengaman, AP’s global climate and environment news director. Six more journalists will join the team in 2023.

AP combines data and storytelling for an ongoing series about people uprooted by weather around the globe. In September, a report described the plight of Kenyan women attacked by a crocodile in a lake near her home, leaving one of her legs “nothing but bones and hanging flesh.”

Due to heavy rainfall tied to climate change “the expanding lake has swallowed up homes and hotels and brought in crocodiles and hippos that have turned up on people’s doorsteps and in classrooms,” according to the report. The woman was washing in shallow water after a day in the maize fields when the crocodile grabbed her.

The journalists found data to support what they suspected. This was not an isolated incident. A huge jump in crocodile attacks was tied to weather changes, said co-author Julie Watson.

Winnie Keben stands in her homestead as the sun sets at Meisori village in Baringo County, Kenya, Wednesday, July 20, 2022. (AP Photo/Brian Inganga)

Finding the statistics is only the first step. “It will get you on your way, but the story will be dry if you stop there,” said the veteran AP reporter. “If the data is to mean something, you need to humanize and make it real for people. It is important to find an example of what the numbers are saying, then really drill down as far as you can.”

Following are three other media organizations that recently expanded climate coverage:

  • In November, the Washington Post announced it was tripling the size of the climate team to over 30 journalists and adding “Climate Lab,” a section that uses data and graphics to tell stories. There also is a site for climate solutions. The Post won a Pulitzer Prize in 2020 for explanatory reporting on global warming.

  • Deutsche Welle recently partnered with Covering Climate Now, a news collaboration of 500-plus news outlets, to expand coverage. “The climate crisis is one of DW’s focus topics. By joining Covering Climate Now, we can collaborate with partner newsrooms around the world to ensure the issue is given the emphasis it needs,” DW said in a press release.”

  • In October, National Public Radiocreated a new climate desk to cover “what might be the most important story of our time.” The supervising editor of the station’s Energy and Environment collaborative joined the team along with four climate journalists. Two reporters are assigned to explanatory journalism, helping the public understand changes to the planet.

Climate journalism has turned a corner. Environment coverage is woven into every newsroom beat, there is more investment in climate projects and a greater demand for specialists in the field. From a Washington Post ad: “Seeking two reporters to serve as global climate correspondents, new positions at the heart of expansion of climate coverage.”

One thing remains unchanged: “It is important to remember that real people live the issues we report about. They should always have a voice. That is fundamental to our job,” said ProPublica’s Lustgarten.

Resources for stories on climate refugees

]]>
How to create data visualisations that serve the public https://datajournalism.com/read/longreads/accessible-data-visualisation Thu, 08 Dec 2022 11:00:00 +0100 Emilia Ruzicka https://datajournalism.com/read/longreads/accessible-data-visualisation A core tenet of journalism is to provide truthful, factual information in order to hold those in power to account. Data journalism is no different, though its practitioners may use a different set of tools to collect, investigate, and express that information. In addition to typically written stories, data journalists have a wide range of demonstrative forms at their disposal, including static visualisations, interactive apps, and data tables.

Though each of these additional storytelling formats can immeasurably enhance a story, they also present barriers to the audience. Not only is it possible that readers cannot physically view a data visualisation due to visual impairments, but it’s also important to consider that users may simply be unfamiliar with how to read and digest such assets.

In an effort to make journalistic content as widely read as possible, it is essential to take steps towards more accessible data visualisation practices. Whether accommodating readers who cannot see the work in the same way the journalist does or adjusting conventions to ensure the reader is adequately guided through the information, making data visualisations more accessible enhances the audience experience for all those who interact with the work.

How do you make data visualisations accessible to the visually impaired?

It’s estimated that about 10% of the global population lives with a disability of some kind, totalling approximately 650 million people. Beyond those individuals, about 2.2 billion people globally have some form of blindness or visual impairment, making its prevalence a staggering one in four people.

Though data journalists are often working on a short timeline, excluding a quarter of the population from being able to easily and accurately read the news is a clear oversight and prevents people from being informed about the world around them. Luckily, many simple changes can be made to data visualisations and other multimedia news assets to address this disparity.

How does colour blindness impact data visualisation?

Colour blindness exists in many forms, including red-green colour blindness, blue-yellow colour blindness, and total colour blindness. Each of these conditions comes with its own set of challenges, but they all impact the way that a person sees the hue of a colour or the perceived difference in light wavelengths reflected by the colour. Due to this commonality, similar techniques can be employed to help accommodate all types of colour blindness.

This image shows two hue scales side by side. The scale on the left begins black at the top and fades through increasingly lighter shades of grey until it is white at the bottom. The scale on the right is a rainbow, beginning with red at the top and fading through orange, yellow, green, blue, indigo, and violet before returning to red.

Source: Illusions in Data Visualization - Rock Content

First, because colour blindness only impacts the way that hue is perceived, other aspects of colour can be manipulated instead in order to make the visualisation accessible to those who struggle to differentiate between hues. One such aspect is the brightness, or brilliance, of a colour. Brightness is the perceived difference in light amplitude reflected by the colour and can make colours appear more or less intense.

This image shows a rainbow of six colours and what those colours look like with the hue, or colour, taken out of them. It is difficult to tell the difference between green and blue and red and pink in the grayscale version of the colours, demonstrating what a colour blind person might observe.

Source: Mixing Colours of Equal Luminance — Part 2 | by Colin Shanley | Design + Sketch | Medium

By ensuring that the colours in a data visualisation are adequately different in brightness, colour-blind people will have less difficulty discerning which marks in the visualisation correspond to different information. In order to test whether your colour scheme is adequately diverse in brightness, use a colour blindness simulator like Coblis. Simply upload an image of your visualisation or colour scheme and then use the selection criteria at the top to see how that image would appear to a person with various forms of colour blindness. If the colours appear too similar, adjust your colour scheme to differentiate them more.

This image is a screenshot of Coblis showing what a photo of a pile of crayons looks like with full-colour vision and with one type of colour blindness.

Source: Coblis — Color Blindness Simulator

In addition to independently checking colour schemes, some data visualisation tools have built-in functions for testing the accessibility of images for people with colour blindness. Datawrapper provides the ability to toggle between views demonstrating some types of colour blindness so that users can ensure their visualisation is accessible before publishing.

This image is a screenshot of the Datawrapper tool that can show users how their data visualisation looks to people with various forms of colour blindness.

Source: Datawrapper

Though colour is an excellent tool for those with full-colour vision, it obviously has many limitations. To further increase the accessibility of a data visualisation, experiment with visual cues other than colour in order to differentiate categories of information. Utilising patterns and textures is a fantastic way to demarcate a range of data when colour alone isn’t sufficient. Adding dots, stripes, hash marks, and other shapes avoids the issue of colour similarity entirely.

This image shows a line chart that uses both texture and colour to indicate which line corresponds to which function.

Source: Two Simple Steps to Create Colorblind-Friendly Data Visualizations | by CR Ferreira

To take it a step further, separate all your data categories with a high-contrast border, ensuring that no two colours are placed directly against each other. This circumvents the issue of simultaneous contrast, which is when two colours placed directly next to each other change the way the hue and brightness of those two colours are perceived by the human eye.

This image shows an illusion that demonstrates how simultaneous contrast changes the way colours are perceived. All reds are the same shade of red, and all greens are the same shade of greens throughout the image despite appearing different.

Source: What is simultaneous contrast

How can completely blind people access data visualisations?

To be inclusive of those who are completely blind, it’s essential to provide non-visual translations of data visualisations. The most straightforward tool for beginning this process is alt text. “Alt text” stands for “alternative text” and is a brief description of the image that can be read by a user, either visually or with a screen reader, if the image cannot be viewed.

This image shows a screenshot of the alt text for the image in a tweet by the Data Visualization Society.

Source: https://twitter.com/DataVizSociety/status/1594737490589515785

There are advantages to alt text beyond physical accessibility, including a boost in SEO value and the ability for people with slow internet connections to know what an image contains even if the full file will not render. In general, it is best to keep alt text succinct and descriptive, providing just enough information to substitute for the image. Avoid repeating words, and be sure to include a reference to any text in the image. For data visualisations, the title of the visualisation along with a mention of the form of the image (bar chart, map, etc.), is a great place to begin.

People who access information with a screen reader not only rely on alt text to describe images to them but also often navigate web pages with a keyboard or other button-based device instead of using a mouse or trackpad. Though this is not a problem for static data visualisations, it often becomes complicated when working with interactive media.

The main culprit of this difficulty is the hover function. Though showing a tooltip, revealing a factoid, or providing other information when the mouse hovers over a particular spot is visually striking, there is no equivalent action for a person exclusively using a keyboard. As a result, those who use screen readers are almost always unable to access these functions.

This image shows a bar chart from a story by the Pudding about who is represented on banknotes. When the mouse hovers over a dot on the bar, more information is given in a tooltip about who that dot represents.

Source: Who’s in Your Wallet?

Instead of using a hover function, allow users to click on objects in interactive visualisations in order to access information. Alternatively, think about whether the information could be integrated or displayed without action from the user. Lastly, always question whether the information needs to be directly in the data visualisation or if it could be provided in an accompanying table, caption, or additional visualisation.

In general, the more places where descriptions of data visualisations can be provided, the better. When feasible, include longer descriptions of images and data visualisations in captions or directly in the copy of the story. These locations allow the journalist to have more freedom with the detail of the description than can necessarily be achieved with alt text alone and are more easily read by a screen reader.

Although screen readers are a fantastic tool that allow blind and visually impaired people to access most web content, having an audio version of a story is an even better format. The audio version should not only contain a reading of the article but also include spoken descriptions of any accompanying images, data visualisations, and multimedia. For an example, check out the audio recording of this article.

Finally, as the world of data visualisation has become more expansive, data sonification has entered the field. Data sonification is exactly what it sounds like: using sound to demonstrate patterns or stories in data. One great example is this sonification of earthquakes in Oklahoma, made by Reveal.

How do you make data visualisations more accessible for everyone?

Beyond providing accessible content to those with physical disabilities, journalists should always strive to make their content digestible for everyday readers. Though it can be difficult for the creator to catch when a visualisation is difficult to read or confusing for the average user, there are a number of design aspects that journalists and editors can consciously consider to help determine whether the data visualisation needs to be revised.

How do font sizes impact reader engagement with data visualisations?

In the age of technology, font sizes may seem irrelevant when zooming in on a web page is an option. However, data visualisations are most effective when the viewer can see the entire image at the same time. Zooming in and out on titles, data labels, and other text can interrupt the viewing experience and potentially cause readers to simply scroll past.

When possible, create visualisations with large enough text to read easily on both computer screens and smartphones. Mobile browsing makes up more than half of all internet usage, so providing data visualisations that are smartphone-compatible is essential to content accessibility. If creating a mobile version of the visualisation is impossible for any reason, make sure to note alongside the image or interactive component that the data visualisation can only be fully viewed on a desktop or laptop computer.

Though requiring users to zoom in on images should be avoided if possible, always ensure that data visualisation files are posted as a medium to high-quality image for web formats. This allows users to have the ability to make the image as large as they need in the cases when zooming becomes necessary.

How can data visualisations be more visually cohesive?

Each data visualisation contains its own visual language, full of colours, patterns, fonts, and more. In order for users to properly read and digest the data visualisation, the visual language used must be cohesive and transparent.

One way to create visual cohesion is to connect colours and patterns to their intuitive meaning. For example, if a journalist was creating a graphic about recycling, they would likely want to use a primarily green colour scheme, as green is already associated with recycling and other efforts to help the natural environment. Similarly, in the instance that a practitioner wanted to create a visualisation of the Great Pyramids in Egypt, they may want to integrate some triangular shapes into their design.

This image shows a bar graph of statewide recycling data in Pennsylvania. The graph utilises a green colour scheme, which readers already associate with recycling and other environment-related projects.

Source: Statewide Recycling Data

In addition to integrating intuitive meanings via colour and pattern, using variations of the same shape throughout a data visualisation can help readers identify the visual language of the image more easily. Depending on the story's topic, the place it will be published, and the tool used to create the visualisation, a data journalist might choose to use more curved lines, sharper angles, or a particular pattern that connects to the topic.

This image shows a data visualisation from Science that repeatedly uses the simplified form of a human to indicate that the research study was about the human genome.

Source: Haplotype-resolved diverse human genomes and integrated analysis of structural variation | Science

With regard to both colour and shape, utilising the power of contrast can help readers understand the relationship between various parts of the data. If two conditions are at odds with each other, like areas with severe drought versus severe flooding, then a contrasting colour scheme will help the reader understand the opposite relationship.

This image shows a world map with each country coloured in a contrasting colour palette depending on its level of hazardous waste generation. Green indicates a lower volume of hazardous waste, and red indicates a higher volume.

Source: Hazardous Waste Management: An African Overview

Outside of colour and shape, fonts play a significant role in the perceived tone of a data visualisation. A chart with its title written in Comic Sans will most likely be perceived as rudimentary or childish. On the other hand, one that uses Futura may be seen as modern, and one that uses Times New Roman might be considered academic or authoritative. Be sure that the fonts in data visualisations match the topic and intended tone of the image so that readers are poised to digest the information.

This image is an infographic detailing the different tones set using slab serif, sans serif, serif, and modern serif fonts. Each font comes with its own personality, from attention-grabbing to timeless.

Source: Font Psychology: Here's Everything You Need to Know About Fonts - Designmodo

How can data practitioners better guide reader interactions with data visualisations?

Even when data journalists do their best to make the most visually pleasing and cohesive graphics possible, a nearly foolproof method for getting readers to understand a data visualisation is simply writing out instructions. It may seem tedious, but this tactic can be especially useful for more unusual data visualisation formats and complex interactive graphics.

The most basic form of directions in data visualisations is data labels. It is best practice to directly label data points when possible instead of using legends or keys. This technique allows users to see the data label and the data simultaneously instead of forcing them to look at the chart, then at the legend, and back at the chart again. It is also a more compatible format for screen readers in the case of interactive visualisations.

This image shows two versions of the same line graph. The version on the left uses a legend, and the version on the right uses direct labelling, demonstrating the increased cohesiveness that direct labelling provides.

Source: Directly Labeling Your Line Graphs | Depict Data Studio

The next step up from data labels is data visualisation captions. As discussed earlier, providing a description of the data visualisation in a caption is helpful for those using screen readers. However, captions also help visually-able users more easily digest the data. Captions might contain the name of the data visualisation format (especially if it’s unusual), a note about where to look first, or a primary trend represented in the image.

In cases where proper data labels and captions still don’t feel like quite enough information to guide the reader, more extensive instructions can be provided in the form of one or more paragraphs of introductory text before the reader is asked to engage with the visualisation. This would likely only be necessary in the case of a very complex visualisation or interactive web app. If a web app needs to provide users with even more guidance, consider developing some introductory pop-ups to let the reader know what different aspects of the visualisation are intended to do.

This image shows the opening page of an interactive story by the Pudding that has explicit pop-up instructions telling the reader how to navigate the content.

Source: Tracking Heat Records in 400 U.S. Cities

Conclusions

As with all data visualisation methods, accommodating as many users as possible and creating more accessible products is a constant learning experience. Whenever possible, provide as many accommodations as is reasonable for the project and strive to integrate any feedback given by readers moving forward.

To help you create more accessible data visualisations that serve the public, the Lessons learned and Recommended resources sections below provide a short checklist of tips to reference day-to-day and a swath of additional reading materials to expand your knowledge about accessible data visualisation techniques.

Lessons learned

  • Why is accessible data visualisation important? Data journalists provide insight into the data stories that exist in the world for everyday people. Because those data stories impact everyone, regardless of their ability to process data visualisations, working to create more accessible data visualisation expands the potential audience for a story, ultimately making that story more useful.

  • How do you make data visualisation accessible? Data visualisation can be made more accessible both by accommodating physical disabilities, like blindness and by tailoring the design to be intuitive for readers based on how humans interact with visual information.

  • How do you make data visualisations accessible for someone who is visually impaired? Ensuring that your visualisation includes contrasting colours and visual indicators like patterns and textures can make your work more accessible for people who are colour-blind. In the case of a person with total blindness, accommodate screen readers by providing informative alt text, including written descriptions of the visualisation and its trends, and avoiding hover functions in interactive visualisations.

  • What are three ways to make data visualisations more accessible for everyone? To make data visualisations more reader-friendly even for those with fully-functioning eyesight, take care to make your font sizes easily readable on all screen sizes, promote visual cohesion by connecting colours and shapes in your visualisation to preexisting connotations, and provide explicitly written instructions for how to read the visualisation when necessary.

Recommended resources

Below are lists of various resources that are helpful when learning more about data visualisation accessibility. These lists are provided in addition to the resources linked throughout the above article.

General data visualisation accessibility resources:

Colour resources:

Alt text resources:

Data sonification resources:

Mobile adaptation resources:

Typography resources:

]]>
Uncovering the truth: Exploring the benefits of federated databases for policing records https://datajournalism.com/read/longreads/database-police-records-accountability Tue, 06 Dec 2022 10:00:00 +0100 Cheryl Phillips https://datajournalism.com/read/longreads/database-police-records-accountability In the past four years, officers in Bakersfield, California have broken 31 bones. In every one of those cases, the officers involved received no discipline. This startling finding was only uncovered because of an unprecedented cross-domain collaboration to make police records transparent. In Washington, D.C., The Washington Post partnered with the Investigative Reporting Program at the University of California-Berkeley, publishing The Unseen Toll of Nonfatal Police Shootings and building on its already impressive Fatal Force database.

And across the United States, some 20 different news organizations and journalists are working with Big Local News out of Stanford University to collect police decertification records, building up a repository that will help anyone trying to track problem officers.

Such collaborative data journalism efforts are international in scope as well. In Brazil, for example, a small team of reporters from the news organization Ponte partnered with Marcelo Soares, a leading data journalist, to expand their abilities and the scope of their work into policing issues.

“Ponte has a capable, diverse and small team of human rights-minded reporters, who hit hard and take no guff,” Soares wrote in an email. “What they didn't have was too much experience handling data analysis.”

“A story they could publish after literally our second class was this one, "Deaths Without Color". They used São Paulo data on police killings and showed how the police gradually stopped recording the ethnicity of many people they killed,” Soares added. “It grows month by month after the deaths of George Floyd in the U.S. and Beto Freitas in the south of Brazil.”

Now, this type of work is expanding in a new way, moving from a web of journalists working together to include others outside journalism: data scientists, advocates, and criminal defense lawyers. In California, the Community Law Enforcement Accountability Network is building a model for how public defenders, community advocates, data scientists and investigative journalists can work together to uncover and expose police records.

Even with cross-domain collaboration, the challenges are huge. But without such partnerships, the task would be impossible, said Barry Scheck, the founder of The Innocence Project, a nonprofit organization that works to free those wrongfully convicted.

“Working across domains is, of course, essential because that’s the only way we are going to accurately and reliably identify and gather and share the data. So, you are best able to do that when you are working across domains and with different groups that are all engaged in the same enterprise,” Scheck said. “The second part that I’ve really come to appreciate is the critical function that civil society plays when trying to perform oversight of the policing function. It [policing] really is embedded in secrecy and has been, and I don’t see any other way to effectively get this out into the open.”

So far, the effort has obtained 165,000 records all through public record requests from around 700 agencies across the state of California, said Lisa Pickoff-White, a visiting senior data journalist at Big Local News and KQED data journalist. Entering just one complete case can take anywhere from an hour to four hours.

But working across disciplines may soon pay off. Data scientists at Berkeley are helping to build out tools that will make extracting key facts easier, a human-in-the-loop machine learning system. Meanwhile, journalism interns at Stanford and the Investigative Reporting Program at Berkeley hand-enter and verify data from the cases and then help report out stories. That’s how the Bakersfield story on police breaking bones was produced. The goal: keep scaling up and publishing critical accountability journalism along the way, all done with collaborative partners, from lawyers to data scientists.

The roots of a policing collaboration

In 2014, during my first year teaching data journalism at Stanford University, I gave my students a public records assignment: request police stop data from state patrol agencies across the United States. A few months and millions of traffic stops later; I started a conversation with Sharad Goel, an engineering professor with expertise in criminal justice and racial bias (now at Harvard University). Within the year, we launched the interdisciplinary Stanford Open Policing Project. In the years since, we have advised police agencies, trained hundreds of journalists on how to analyze the policing data for their own stories and seen law and policy changes from cities Nashville to California.

The Open Policing Project spurred the idea of another collaborative effort, Big Local News, a data-sharing platform that makes data-driven accountability journalism easier to achieve for local newsrooms.

Around the same time, a new California law spurred another collaboration. “Lawmakers passed the landmark “Right to Know Act” in 2018, chipping away at a four-decade wall of secrecy concerning police internal investigations and officer discipline in California,” according to what is now known as the California Reporting Project. “Six founding organizations joined together to seek the transparency that SB 1421 promised.”

Those initial six news organizations grew to 40, and they joined forces to sue when necessary and to work together on stories they couldn’t have done alone. In 2019, the collaboration morphed again, spurred by Scheck's interest. Some 70 people, data scientists, journalists, lawyers, advocates and more, came together at Stanford that fall to discuss how best to continue the effort toward police transparency.

Already, the journalists were facing daunting challenges, from negotiating for records to technical issues with working with the records once obtained.

The solution: work across domains to build an infrastructure that will support a variety of needs and that will foster that same transparency in police accountability. The new effort: The Community Law Enforcement Accountability Network (CLEAN). The partners include Big Local News, the California Reporting Project and its newsroom partners; criminal justice advocacy groups, such as the ACLU, the National Association of Criminal Defense Lawyers, the Innocence Project; and data scientists from the University of California at Berkeley.

“Working across domains is of course essential,” Scheck said. “Because that’s the only way we are going to accurately and reliably identify and gather and share the data.”

The negotiation

Now, every week, Big Local News Journalist Phoebe Barghouty sends emails and has phone call after phone call with local government clerks and information request managers about the status of requests for the California Reporting Project. Sometimes she leads student training on how to file requests. On other days, she confers with lawyers on which cases may end up as part of a lawsuit challenging denials by a police agency. Just this October, KQED, one of the California Reporting Project partners, sued to obtain critical records from the California Department of Corrections.

Barghouty works with journalism students at Stanford and Berkeley, along with other partners, to keep the public records moving along. (Those same students work on stories too.)

Finally, Barghouty helps import and organize gigabytes of documents, audio and video files. Every week, she meets with Pickoff-White to chart the next steps.

“We are possibly one of the largest projects that’s doing requests at scale,” said Pickoff-White. Just the act of requesting police use of force and misconduct cases is daunting. The journalists work with the nonprofit public records organization, MuckRock, to stay organized on records requests, yet another partner in our sprawling collaborative effort. Once documents come in, the journalists and students then upload all those records to DocumentCloud. From there, the partners begin the process of hand-entering dozens of fields of information into a database. And the information in that database is analyzed for possible news stories, one jurisdiction at a time.

“We’ve been able to request so much but we’re still figuring out how to process it all,” said Pickoff-White.

From documents into data

Figuring out “how to process it all” is where the Berkeley Institute for Data Science comes into the picture. Led by Nobel Laureate Saul Perlmutter, BIDS is partnering with journalists and lawyers alike to provide the tools to move beyond manual data entry. Professors Joe Hellerstein and Sarah Chasins of the Department of Electrical Engineering and Computer Sciences and Aditya Parameswaram of the School of Information and EECS, are developing new methodologies in AI, databases, human-computer interaction, and visualization to enable the work. They now routinely meet with data and investigative journalists to sort through the daunting technical challenges. Some of their PhD students worked with investigative reporting students on clustering algorithms, for example, to better identify themes in the policing documents.

The work by BIDS with the effort started with a conversation between Scheck and Perlmutter, not long after George Floyd was killed by police officers in Minneapolis in May 2020.

Collaboration with scientists has already become part of the model for data journalism.

In August, The Places Project reported on its efforts to build a “Collaborative Platform for Societal Issues.” “In order to experiment with a renewed collaboration between researchers and journalists, the PLACES project wanted to make them co-actors in the process, inviting them to take part in a citizen science approach,” an executive summary of the report said.

And an article on using satellite imagery in journalism, published in the Online Journalism Blog, noted that such work almost always involves a collaboration.

“Most journalistic pieces that use AI and satellite imagery are collaborative projects and rely on a data expert,” that author, Federico Acosta Rainis, said in that article.

The CLEAN leaders hope this work will be an example for what can be done elsewhere in criminal justice reporting. The effort followed up its initial convening in 2019 with another daylong meeting in December 2021 (with a hiatus in between due to the pandemic).

“It will be a prototype in terms of a model of cooperation,” Scheck said at the beginning of that 2021 meeting. “That stakeholders in the criminal legal space, -- we have with us today, progressive prosecutors, inspector generals, a lot of reporters, public defenders, civil rights lawyers, you name it – can all get together and in an efficient, appropriate way, share information about law enforcement misconduct that we have never known about before.”

On the data end of the project, the data scientists are working to build what they call a federated database, one that will make it easy to both protect private data and share public data among the partners. The second component is to build tools to make it easier to access and analyze the data, said Perlmutter in the 2021 meeting.

“The conversation is now very up-front and center about how best to help society and do fundamental research with data science,” he said before ticking off the ability to use newer machine-learning methods, other AI systems, cloud computing and more. And seeing disparate organizations work together is exciting, he added. “It makes you feel that we have the capability to solve the deep problems that we’re seeing.”

The push by 20 news organizations to collect police certification data expects to use some of the same technologies being developed by CLEAN. And the certification project is already using a data processing pipeline developed by Big Local News. That effort, like many of these new collaborations, also is building on previous efforts. John Kelly, now a data editor at ABC-owned stations in the United States, first approached this growing, informal network of news organizations eager to work together because he had already developed an earlier version of a certification database for a project at USA TODAY. But moving the scope beyond one organization makes it easier to scale a project in a way that benefits more newsrooms. And that benefits the public more too.

“Local newsrooms don’t always have the time or the technological skill or resources to go through these records on their own,” said Pickoff-White. She added that the next step will be to make at least some of the data more widely available. “We’re hoping to allow members to more easily sort and sift and analyze records about policing in California and eventually disseminate this information to the public.”

That Bakersfield story about police breaking bones with baton strikes was made possible because of the involvement of more than a dozen students from Stanford and Berkeley who “painstakingly” entered all the data, noted David Barstow, the head of investigative reporting at the UC Berkeley Graduate School of Journalism, at the 2021 convening. “That was something that only emerged from this systematic examination of every single use of force in Bakersfield.”

As the tools built by BIDS continue to be used, the available data will grow, and so will the possibilities.

The increased transparency that comes from such efforts may mean that policing itself will change, Scheck said. “They (police) should be subject to the same transparency as any other public servant,” he said. “And that’s what’s at the heart of this struggle right now.”

Where do collaborations end?

Back in Bakersfield, a few weeks after the stories about police breaking bones, student reporters were back at it. For their second story, as part of the CLEAN project, they focused on mental health issues. Their work was published by local radio stations, KQED in San Francisco and The Associated Press. And after that, the reporters didn’t just leave. Instead, they took the collaborative ethos to a new level.

Creating a system that will include disseminating policing data to the public means understanding what the public needs and wants. And that means you guessed it, collaboration. Students and journalists met with community groups, surveying them for what they wanted out of the new police transparency effort. Now, that information is being used to help develop the searchable interface available to the public.

Lessons learned

  • Build in time to turn messy police records into structured data You can use the power of a curated crowd, those collaboration partners, to make the lift of turning messy documents into valuable data achievable. At the same time, you can produce related stories as you go. You will find nuggets of newsworthiness as you collect and read through the documents you obtain. Report on those. You can use them at the end of a big project or publish as you go. Either way, you are building forward momentum.

  • Collaborations can cross domains Work across spheres with lawyers, data scientists, engineers and more to achieve your goals. Some projects are just too massive to achieve otherwise. But set up working agreements from the start. At a minimum, it ensures you have had important conversations from the outset.

Use other data to add vital context. Some examples include

]]>
They're not database rows; they're people https://datajournalism.com/read/longreads/newsrooms-personal-information-policy Thu, 01 Dec 2022 09:56:00 +0100 Thomas Wilburn https://datajournalism.com/read/longreads/newsrooms-personal-information-policy Over the last decade, one of the goals of data journalism has been to increase accountability and transparency through the release of raw data. Admonitions of “show your work” have become common enough that academics judge our work by the datasets we link to. These goals were admirable and (in the context of legitimizing data teams within legacy organizations) even necessary at the time. But in an age of 8chan, Gamergate, and the rise of violent white nationalism, it may be time to add nuance to our approach.

For this discussion, I'm primarily concerned with the publication of personal data (also known as personally-identifiable information, or PII). In other words, we’re talking about names, addresses or contact info, lat/long coordinates and other geodata, ID numbers (including license plates or other government IDs), and other data points that can be traced back to a single individual.

Much of this is available already under the public record, but that’s no excuse: as the NYT Editorial Board wrote in 2018, “just because information is public doesn’t mean it has to be so easy for so many people to get.” It is irresponsible to amplify information without thinking about what we’re amplifying and why.

For this discussion, I'm primarily concerned with the publication of personal data (also known as personally-identifiable information, or PII).

Moreover, the idea that journalists could contribute to personal data leaks isn't theoretical: many newsroom projects start with large-scale FOIA dumps or public databases, which may include exactly this personal data. There have been movements in recent years to monetize these databases--creating a queryable database of government salaries, for example, and offering it via a subscription or using it as a source of reliable traffic from rubberneckers. Even random public records requests may disclose personal data. Intentionally or not, we’re swimming in this stuff and have become jaded as to its prevalence. Is it right for us to simply push it out without re-examining the implications of doing so?

The Texas Tribune's salary database includes the names and pay of all Texas public servants making more than the median salary for state workers.

I would stress that I’m not the only person who has thought about these things, and there are a few signs that we as an industry are beginning to formalize our thought process in the same way that we have standards around traditional reporting:

  • The Markup’s ethics policy contains guidelines on personal data, including a requirement to set an expiration date (after which point it is deleted).

  • Reveal’s ethics guide doesn’t contain specific data guidelines but does call out the need to protect individual privacy: “Recognize that private people have a greater right to control information about themselves than do public officials and others who seek power, influence or attention. Only an overriding public need can justify intrusion into anyone’s privacy.”

  • The AP no longer reports the names or runs stories on mugshots for minor crimes.

  • The New York Times ran a session at NICAR 2019 on “doxxing yourself,”in part to raise awareness of how vulnerable reporters (and by extension, readers) may be too targeted harassment and tracking.

  • A 2016 SRCCON session on “You’re The Reason My Name Is On Google: The Ethics Of Publishing Public Data” explored real-world lessons from the Texas Tribune’s salary databases (transcript here).

  • Poynter wrote about the conflicts and difficulties that journalists have when publishing personal data all the way back in 2013.

A data-rich environment is dangerous

In her landmark 2015 book The Internet of Garbage, Sarah Jeong sets aside an entire chapter just for harassment. And with good reason: the Internet has enabled new innovations for old prejudices, including SWATting, doxing, and targeted threats at a new kind of scale. Writing about Gamergate, she notes that the action of its instigator, Eron Gjoni, “was both complicated and simple, old and new. He had managed to crowdsource domestic abuse.”

More recently, until it was driven off of its CDN provider, the Kiwi Farms forum served as a home base for digital bullying, as posters there would pick vulnerable targets (especially those who were LGBTQ), indiscriminately collect information about them by scouring different web sources, and then attempt to hound them into suicide or retreat from public life. KF was not known for being particularly good at gathering information, but they didn't need to be: accuracy is not the point of a harassment campaign, and collateral damage was something it was happy to encourage.

I'm focusing on harassment here because I think it provides an easy touchstone for the potential dangers of publishing personal information. Since Latanya Sweeney’s initial work on de-anonymizing data, an entire industry has grown up around taking disparate pieces of information, both public and private, and matching them against each other to create alarmingly-detailed profiles of individual people. This is the foundation of the business model for Facebook, as well as a broad swathe of other technology companies. This information includes your location over time. And it’s available for purchase, relatively cheaply, by anyone who wants to target you or your family. Should we contribute, even in a minor way, to that ecosystem?

These may seem like distant or abstract risks, but that may be because, for many of us, this harassment is more distant or abstract than it is for others. A survey of “news nerds” in 2017 found that more than half are male, and three-quarters are white (a demographic that includes myself). As a result of this background, many newsrooms have a serious blind spot when it comes to understanding how their work may be seen (or used against) underrepresented populations.

In particular, as rhetoric has ramped up over the last decade, it's become clear that newsrooms are not listening to the few trans journalists in their ranks. When the US "paper of record" fights back against updating historical bylines that contain their own reporters' deadnames, it sends a clear message about whose data matters, whose doesn't, and how seriously the institution takes the threat of personal metadata.

We are very bad as an industry at thinking about how our power to amplify and focus attention is used. Even if harassment is not the ultimate result, publishing personal data may be seen by our audience as creepy or intrusive. At a time when we are concerned with trust in media, and when that trust is under attack from the top levels of government, more care is necessary.

Names and shame

Ultimately, I think it is useful to consider our twin relationship to power and shame. Although we don’t often think of it this way, the latter is often a powerful tool in our investigative reporting. After all, as the fourth estate, we do not have the ability to prosecute crimes or create legislation. What we can do is highlight the contrast between the world as we want it to be and as it actually is, and that gulf is expressed through shame.

The difference between tabloid reporting and “legitimate” journalism is the direction that shame is directed. The latter targets its shame toward the powerful, while the former is as likely to shame the powerless. In terms of accountability, it orients our power against the system, not toward individual people. It’s the difference between reporting on welfare recipients buying marijuana, as opposed to looking at how marijuana licensing perpetuates historical inequalities from the drug war.

Our audiences may not consciously understand the role that shame plays in our journalism, but they know it’s a part of the work. They know we don’t do investigations in order to hand out compliments and community service awards. When we choose to put the names of individuals next to our reporting, we may be doing it for a variety of good reasons (perhaps we worked hard for that data, or sued to get it) but we should be aware that it is often seen as an implication of guilt on the part of the people within.

This NPR database publishes the names and former addresses of Americans who took FEMA buyouts in disaster zones. The people named (in a fully downloadable database) are not guilty of crimes, or culpable for the mismanagement of the program itself.

In the small Virginia county where I went to high school, the local right-wing newspaper would publish the salaries of every teacher in the local public school system. There was no explicit threat of violence, but it was meant to feel invasive and hostile, and it did. When I worked at the Seattle Times and had conversations with editors about potentially creating a salary database for Washington State, it was hard to capture the difference between what we were doing, and what that Virginia paper had attempted to do. For the people named in those kinds of databases, it probably doesn't feel like there's really a difference at all.

Toward a philosophy of PII in reporting

I want to be very clear that I am only talking about the public release of data when I ask for increased caution. I am not arguing that we should not submit FOIA or public records requests for personal data or that it can’t be useful for reporting. I’m also not arguing that we should not distribute this data at all, in aggregated form, on request, or through inter-organizational channels. It is important for us to show our work and to provide transparency. I’m simply arguing that we don’t always need to release raw data containing personal information directly to the public.

In the spirit of Maciej Ceglowski’s Haunted by Data, I’d like to propose we think of personal data in three escalating levels of caution:

  • Don’t collect it!

When creating our own datasets, it may be best to avoid personal data in the first place. Remember, you don’t have to think about the implications of the GDPR or data leaks if you never have that information. When designing forms for story call-outs, try to find ways to automatically aggregate or avoid collecting information you will not use during reporting. I will note that this is often a tougher decision than it seems – consider, for example, the source diversity tracking that many newsrooms are now attempting to incorporate to diversify their coverage, which by extension often means gathering (and retaining) some degree of identifying data.

  • Don’t dump it!

If you have the raw data, don’t just throw it out into the public eye because you can. In general, we don’t work with raw data for reporting anyway: we work with aggregates or subsets because that’s where the best stories live. What’s the difference in policy effects between population groups? What department has the widest salary range in a city government? Where did a disaster cause the most damage? Releasing data in an aggregate form still allows end-users to check your work or perform follow-ups. And you can make the full dataset available if people reach out to you specifically over e-mail or secure channels (but you’ll be surprised how few actually do). Note that even aggregated or anonymized datasets may be vulnerable to so-called Database Reconstruction Attacks.

  • Don’t leave it raw!

In cases where distributing individual rows of data is something you’re committed to doing, consider ways to protect the people inside the data by anonymizing it without removing its potential usefulness. One approach that I love from ProPublica Illinois’ parking ticket data is the use of one-way hash functions to create consistent (but anonymous) identifiers from license plates: the input always creates the same output, so you can still aggregate by a particular car, but you can’t turn that random-looking string of numbers and letters back into an actual license plate. As opposed to “cooking” the data, we can think of this as “seasoning” it, much as we would “salt” a hash function. A similar approach was used in the infosec community in 2016 to identify and confirm sexual abusers in public without actually posting their names (thus opening the victims up to retaliation).

Policies, guidelines, and organizational support

The three "don'ts" above are rules that you can adopt for yourself or for a team that you lead inside a newsroom. They don't require institutional buy-in – they're like your code style or your team's best practice documents. But while ethics exist for ourselves, they are also historically a way that newsrooms establish trust and relationships with a community (see also: the original development of "objectivity" as a method of verifying factual truths, not as an abstention from political life). And that means having a public policy.

At NPR, sometime around 2019, I started the process of developing a set of public guidelines for the News Apps team. Unfortunately, I wasn't able to complete it before leaving the organization in 2021, due mostly to conflicts in scheduling and newsroom staffing availability. As a result, I can offer some guidance on what surfaced during this process so you may get further than I did.

First of all, this is a conversation with many stakeholders. You'll want to talk with your legal department or counsel about any issues that they can foresee (including which terminology may prove binding, such as "policy" vs "guidelines"). You'll also want to bring in your standards department or copy chief, whoever is in charge of normally making coverage decisions. Your goal as a data journalist isn't to be the final decision maker but to be able to help inform the decisions that data-shy editors may need to make.

Second, try to think about your process holistically. Your newsroom may already have a policy for redacting names from coverage or archives when there's a credible threat of violence or when circumstances have changed (say, a non-notable person is accused of a crime and later found innocent, but the original coverage still surfaces in searches for their name). I assume you're also already following (or are at least aware of) the Trans Journalists Association style guide in terms of policies on redacting or altering deadnames. Having a personal data policy is a great way to unify your organization's approach when covering trans communities, people in the criminal justice system, and other communities that are normally shy about being in the journalistic spotlight.

Third, try to think about how non-aggregate data is dangerous in combination and who has access to it. The ability to link names with addresses, geolocation or birth dates gives potential harassers more leeway to combine it with information obtained elsewhere. If it's possible to routinely delete or archive data in an inaccessible place at a preset time after publication (say, 90 days), you can lower the possibility of misuse by staff or inadvertent leaks. Internal systems, such as source diversity audits, should be designed so that reporters and staff cannot access the original data if it is retained.

Finally, always consider how a public guideline or policy could be used by people acting in bad faith. For example, take into account those who will use the policy to try to make reporting on their actions more difficult, to manipulate the tone of your journalism or public figures who try to get coverage stricken. Spend time imagining how someone might try to abuse your rules, and then have conversations about how to respond ahead of time so that you're not trying to figure those situations out under pressure when – not if – it happens.

Lessons Learned

Once upon a time, this industry thought of computer-assisted reporting as a new kind of neutral standard: “precision” or “scientific” journalism. Yet as Catherine D’Ignazio and Lauren Klein point out in Data Feminism, CAR is not neutral, and neither is the way that the underlying data is collected, visualized, and distributed. Instead, like all journalism, it is affected by concerns of race, gender, sexual identity, class, and justice.

It is also, for better or worse, often an extractive process. Databases serve as the ultimate way to parachute into a community and make pronouncements about it, specifically because they do often feel so all-encompassing. On Chalkbeat's data team, we have tried to be conscious of the temptation to treat spreadsheets and public information as the story itself instead of relying on reporters who know and understand the locals' concerns, and can perform journalism with the community, not just on it. More importantly, we know that the way we publish today affects our ability to report within communities in the future, especially if they believe we're contributing to harassment, shaming, or abusive policy.

Incorporating an opinionated personal data strategy into our work gives data journalism a way to think about community-building and engagement. On a personal level, we can practice restraint in what we collect, dump, and publish in a raw form. As organizations, it's possible to create strong public commitments and policies on how we will handle identifying information for individuals.

Recommended resources

  • Book: The Internet of Garbage, by Sarah Jeong. As a comprehensive cross-section of how harassment, spam, and copyright collide on the Internet, it's hard to top Jeong's book, which details not only the complicated problems of each, but also how they bleed into each other and interact in complex ways.

  • Article: Taking care with source security when reporting on abortion, by Olivia Martin, Martin Shelton, and Jessica Bruder. It's worth remembering that not only do we as journalists need to think about the bulk data we release, but also understand our reporting in the context of bulk data that could be used to identify and even prosecute, our sources.

  • Article: How Data Journalists Can Use Anonymization to Protect Privacy, by Vojtech Sedlak. A good overview of techniques that you can use to season or de-identify a database, including helpful notes on how techniques can be broken or circumvented to re-identify subjects.

]]>
How data can power public health investigations — through collaboration https://datajournalism.com/read/longreads/how-data-can-power-public-health-investigations-through-collaboration Tue, 29 Nov 2022 10:19:00 +0100 Betsy Ladyzhets Dillon Bergin https://datajournalism.com/read/longreads/how-data-can-power-public-health-investigations-through-collaboration In the summer of 2021, the Documenting COVID-19 project published an article with The Kansas City Star about an elected coroner in Macon County, Missouri, who told us he routinely went against CDC guidance and wrote down causes of death that excluded COVID-19 if it “pleases the family.”

The story went viral, and was picked up by multiple national outlets. After the story was published, a professor at the Boston University School of Public Health reached out to our team to say that he had been studying the potential undercount of COVID deaths across the country. He too was concerned that the anecdote from the Macon County coroner was part of a much larger problem, one that showed itself in his analysis of mortality data at the county-level.

When our team first spoke with that professor, Andrew Stokes, he had almost as many questions for us as we did for him. His team had been working for months on a statistical model that led them to believe that something about the country’s death investigation system was resulting in gaps in the expected number of COVID deaths.

Over the next several months, we worked with Stokes’s Boston University team to find counties across the country whose trend in deaths during the pandemic raised concerns that a significant amount of COVID deaths were being missed in official death tolls. At the same time, we began working with the USA TODAY network and local reporters, including those from hard-hit states like Missouri, Louisiana and Mississippi. In a follow-up project coming this year, we have continued to collaborate with Stokes and his team, along with local reporters from states with notable demographic disparities in COVID-19 deaths.

This reporting has required reporters and editors across seven newsrooms, a close working relationship with a team of demographers at Boston University, and feedback from several other experts throughout the process. Reporters asked questions and re-asked questions, assessed questions against the data and interviews, took findings back to experts, and started the process over again.

It took time and significant resources, but in the end, we were able to tell a story that we couldn’t have without collaboration. In this article, we’re going to share some of the things we learned during this project, from deciding if the data is the right foundation for a collaborative project, to working alongside academic researchers and reporters in other newsrooms.

Why this data made sense for collaboration

The COVID-19 pandemic forced an exceptional scenario in journalism: Every newsroom across the globe was working on the same story and every reporter needed the most robust, current data to understand that story. As governments failed to quickly provide this data in early days of the pandemic, groups of journalists, scientists and citizens stepped in to provide it. These collaborations were not just a matter of preference but of necessity.

Numerous data-focused collaborations sprung up in the first year of the pandemic. They ranged from the well-known, hundreds-of-volunteers-strong COVID Tracking Project, housed at The Atlantic magazine and supported through foundation grants, to smaller projects monitoring COVID-19 cases in schools, state and local pandemic policies, contact tracing efforts and many other metrics.

Some initiatives, like the New York Times COVID-19 dashboard and the Bloomberg vaccination tracker, were housed entirely within journalism organizations and took advantage of existing infrastructure and resources. Others were decidedly more “bootstrapped,” running on simple spreadsheets or data visualization platforms.

For all of these initiatives, a similar pattern arose:

The more people you have collaborating on a dataset, the more capacity you build for catching errors, identifying nuance and communicating data findings to a wide audience. Without careful setup at the beginning of a collaboration, however, such communication can get unwieldy. Projects can languish with unclear goals and wonky data limitations, and important insights may not make it to the people who need them most.

The Documenting COVID-19 project, a collaborative open-records initiative sponsored by Columbia University's Brown Institute for Media Innovation and MuckRock, learned these lessons through its Uncounted investigation, which relied on a combination of data analysis at a national level and local reporting on the ground to reveal how short-staffed, undertrained and overworked coroners and medical examiners were nowhere near unified in investigating a possible death from COVID-19. As we prepare to publish a second national story from this project, we’re also pursuing forthcoming projects that will leverage public records to identify more detailed trends in the death investigation systems of specific states.

  • The far and the near

Across the world, scientists, journalists and state health organizations have estimated the “true toll” of the pandemic using a metric called “excess deaths.” Excess deaths are the number of deaths in a given time period that exceed what would be considered normal in any other year. In the case of COVID-19, researchers hypothesize that some of these excess deaths are, in fact, COVID-19 deaths that did not get counted correctly, or otherwise occurred because of the pandemic’s social and economic upheaval. This could include people who died during COVID-19 surges because they were unable to receive medical care for other conditions, for example.

While many people first heard the term “excess deaths” during the pandemic, the metric has a long history in public health research as a way to calculate the broadest possible impact of major health events. Researchers even use excess deaths to look back into history at events like the 1918 flu pandemic.

The Uncounted project focuses on excess deaths in the U.S., but these data are available from almost every country. The Economist, which tracks this metric globally, has complied sources for 117 countries around the world.

As we investigated excess American deaths, our main question wasn’t just whether COVID deaths were being undercounted, but how. We also knew that any hypothesis to that question would have to confront the answer to how in different regions of the country. We needed to pair questions about the larger scheme with knowledge of the specifics of public health in a given county. We were searching for “the far and the near”, and that necessitated finding reporters who wanted to collaborate and knew what stones to turn so that our whole team could compare data to lived experiences in that area.

  • Death data originates at the local level

This approach turned out to be even more essential for this project because mortality data is shaped by how and where it is recorded. Similar to the overall public health system in this country, death investigations in the United States are a patchwork system.Some deaths are investigated with state-of-the-art technology and expertise, while others don’t go beyond a phone call with the family.

When someone dies in a hospital or health care facility, death investigations can be straightforward and are mostly standardized.

When a doctor isn’t present, a separate system often comes into play — the death investigation system of coroners and medical examiners. In the United States, the training, expertise and resources of a coroner or medical examiner in one county can be wildly different from the person investigating deaths in the county next door. This changes the quality of the data, and makes comparing data from different counties a complicated task.

Questions to consider

  • Does your data tell you about the far or the near? Are you the right person to explain those angles, or could you tell a fuller story with the help of someone who knows that angle more intimately?

  • Is there an expert that you could develop a mutually beneficial relationship with?

  • Does the story the data tells end with questions that someone else could answer?

  • Does the data connect different areas or issues?

A quick dive into the data behind mortality statistics

  • Where do you find data about death?

Our investigation started with data about death: who is dying and how. Most of the time, the public sees this data in big picture mortality statistics, like the percentage of people who died from cancer in 2022, or from homicides in 2021. Reporters are more interested in where this all starts. So, where do you find data beyond those big national statistics, about what people die of at a local level? The simple answer is the basic unit of this data: the death certificate. The more complicated answer is that, in the United States, how information reaches death certificates and how it is shared afterwards is highly variable.

Most states in the U.S. don’t consider death certificates public record, though the implications for public health and safety make death certificates the type of document that needs more sunlight.

  • The wonder of WONDER

In the United States, the next best way to get data about death at a local level is using a query portal offered by the Centers for Disease Control and Prevention called WONDER. WONDER, which stands for Wide-ranging Online Data for Epidemiologic Research. After information is recorded on a death certificate, the data is sent on to the CDC where it is entered into WONDER’s provisional mortality database. For more details about WONDER, see our reporting recipe here.

Making sense of data in collaboration with experts

  • Taking advantage of different skillsets

When a data journalist reaches out to an expert, the process is typically rather one-sided. The journalist comes with endless questions and relies on the expert to guide them through analysis; the expert is primarily donating their time and skills in service of a resulting story.

Successful collaborations like Uncounted start from a different mindset, in which both journalists and experts seek to have their work elevated through partnership. While Stokes and his team provide us with novel data analysis, our reporting provides the demography team with new ideas for research and connects the academic work to people’s experiences.

For example: Stokes and his team have used the CDC’s mortality data to demonstrate that people who die at home are more likely to have their deaths attributed to nonspecific causes than those who die at the hospital. As reporters, we can talk to the coroners who fill out those certificates of people dying at home and learn about the resource limitations they face in determining specific causes of death.

  • Making time for thorough collaboration

In-depth collaboration takes much more time than a simple Q&A about methodology. At different points in reporting Uncounted stories, we have had weekly meetings with Stokes – along with hundreds of emails and Slack messages exchanged between teams. The meetings often include progress updates, discussing new findings, and figuring out the best way to visualize an important result in the final story.

Close collaborations with experts can be particularly valuable in the final stage of a story’s production, when data get fact-checked and headlines are determined. Experts can ensure that key statistics are represented accurately and answer last-minute questions from editors, though reporters might have to spend time translating from academic jargon to more accessible language.

Finding human stories behind the data in collaboration with other reporters

  • Balancing local- and national-level reporting

Different types of journalists bring different skills and contexts to a project, which can enhance a collaboration if organized carefully. With Uncounted and other large stories at the Documenting COVID-19 project, we’ve found it particularly valuable to pair up with local reporters: while we offer expertise in data, investigative, or beat reporting, local reporters bring invaluable knowledge of their communities.

As specialist reporters, we often take the lead on analyzing data and talking to experts, using the results to prepare memos; these memos provide key questions that our partners can pose to local sources. For example, an analysis of CDC mortality data for a particular county may lead us to an unusual finding about deaths from ill-defined or nonspecific causes. A local reporter can then question their county’s coroner or public health officials about the finding. “Is it possible some of these deaths were actually from COVID-19?” they might ask. After receiving a response, we can interpret it together.

  • Communicating frequently, remaining flexible

Regular meetings with journalism collaborators, similarly to those with experts, are valuable for staying on the same page about story progress.. These meetings might include anything from figuring out a story’s overall angle to nitty-gritty details like which types of charts work best in a partner’s CMS. We also check in with partners using email, phone calls or a shared Slack server, depending on their preference.

Early in the collaboration process, we make sure everyone involved in the project understands overall objectives and is comfortable with the planned timeline. Plans need to be flexible, though, especially when you’re working with local reporters who have limited bandwidth. Journalists managing daily news deadlines might need more time to complete an enterprise project – especially true for health reporters on call when new COVID-19 surges hit. Similarly, covering a challenging topic like the pandemic itself requires flexibility and empathy; reporters on either side of the partnership might need to take a step back due to burnout from years of covering this crisis.

Another crucial thing to discuss is the story’s editing process which editors will take on which sections of the story, or which stages of the draft? Who will fact-check data points? Which publication’s style guide will you use when there are conflicts? Who will decide when the story is finally ready for publication? All these questions may sound overly tedious at the start of a project, but anticipating these issues early on can save headaches later.

Questions to consider

  • What skillsets do different members of the collaboration team bring to the table, and how might they be used for this project?

  • How much time could reporters on the collaboration team feasibly spend on this project, and how much time might it take to complete?

  • What timeline and story format (length, multimedia components, etc.) are feasible based on other deadlines and commitments?

  • Who is responsible for different aspects of reporting and editing? What is the hierarchy of edits; who makes final decisions)?

  • What audiences are served by the different partner outlets, and how can you ensure the final story meets their needs?

  • What are the team’s preferences for communication? (Email, Slack, regular meetings, all three?)

  • Sharing final stories with different audiences

The collaborative work doesn’t actually end when a story is published: coordinating with partners on sharing final stories can help your work reach the widest audience possible. We always make sure stories are published on MuckRock’s site on the same day as their publication on partner sites, and even try to coordinate with experts on releasing data findings in academic formats like preprints. Writing social media posts in advance, ensuring everyone has access to graphics, and tagging each other can all help with a unified marketing campaign for the story. And if you’re planning follow-ups, sharing reader responses across outlets can also be incredibly valuable.

The value of collaboration is worth the hassle

Working on collaborative journalism projects can mean endless hours in meetings and email threads, haggling over basic style choices or going over data points numerous times. But the effort is worth it to produce truly unique stories that couldn’t come from any one partner.

This is especially true for local newsrooms. In the U.S., many local publications are shrinking: few are able to dedicate time and resources to big investigative projects or to complicated beats like science and health. The Documenting COVID-19 project offers these newsrooms assistance with specialized reporting tasks; we help them produce high-impact enterprise stories while maintaining capacity for day-to-day news.

As we continue new projects in this collaborative model, we’re inspired by other organizations that work similarly, like the ProPublica Local Reporting Network, the nonprofit environmental newsroom Floodlight, and the international project Unbias the News. We hope our work can be a resource for data journalists interested in trying this model, as we prioritize sharing skills instead of competing for scoops.

]]>
It’s time to rethink how we report election results https://datajournalism.com/read/longreads/report-election-results Mon, 31 Oct 2022 00:00:00 +0100 Thomas Wilburn https://datajournalism.com/read/longreads/report-election-results As long as I've been in this industry, I've heard election night described as the peak "news nerd holiday." An election is also a huge economic event, a spike of reader attention and intense search traffic that's addictive. That makes it hard for newsrooms to think clearly about what an election means, much less make decisions that might reduce the flow of pageviews.

But in the wake of the Trump era and the January 6, 2021 insurrection, it's time for us to step back and take a clear-eyed look at what our election coverage is actually doing—and how it may have played a role in this democratic crisis.

If you're reading this, you and your peers probably have some influence over how your newsroom handles results. Maybe more than you think! You can make the argument for (and build the implementation of) pro-democracy election results.

At a bare minimum, there are simple, straightforward steps you can take (or more precisely, stop taking):

  • Stop reporting results that are too early to be useful or informative. Require a 50-75% threshold before showing vote counts, so that early swings don't confuse readers.
  • Stop rushing to call races when it undermines confidence and feeds conspiracy narratives. Delay calls until they agree with the current vote tabulation, and consider holding them altogether until most results are complete.
  • Stop creating maps and visuals that reinforce biases about geography and voting. Land doesn't vote—we shouldn't use it as the basis of graphics about people, or about the electoral college.

If you want to go beyond harm reduction and help transform election coverage into something that actually improves democracy instead of just observing it, read on. We'll take a look back at what went right and wrong in 2020, see how the culture of data journalism has contributed to the problem, and finally start a conversation around a fundamental shift in political data reporting.

Getting ready for an election

I've built digital election results for more than a decade, in three very different modes. At CQ-Roll Call, I published balance of power breakdowns for a small, select DC audience in 2010. For The Seattle Times, from 2014 to 2017, I handled the data processing and display for local, state, and national election results. I joined NPR the day before the 2018 election, and led the project team that completely rebuilt our results rig in 2020.

One of the most jarring contrasts around 2020 results was that we did a lot, as an industry, to warn voters about the potential irregularities and issues. At NPR, we had a reporter essentially on "voting" as a beat for much of the year, specifically covering voting by mail, its security, and historical trends.

From the start, it was clear that misinformation about the 2020 election was going to come from the very top. Professional political journalism did not handle Trump particularly well, but there were plenty of stories where Trump refused to commit to accepting the results of the election if it didn't go his way. Instead, he would make wild, evidence-free statements to undermine confidence in voting, laying groundwork for his post-election strategy.

This was, of course, in addition to the standard misinformation about elections. Right-wing politicians and commentators have regularly made false accusations of fraud, such as voters being bussed across state lines or illegal votes by immigrants. Apart from the rhetoric, there were organized efforts to reduce voting numbers, by interfering in the postal service, rejecting ballots, and creating long wait times for non-white areas.

Given this hostile environment, the NPR News Apps team looked at the reporting our newsroom was doing and held a "threat modeling" session. We brainstormed all the potential ways that things could go wrong, including legal challenges in multiple states, unclear results, and potential violence, and then we used that to influence the design of our results in partnership with the Washington desk. We also built a number of safeguards into our data pipeline, so that we could override or flag individual races as needed.

When it came time to build our results, we placed a note at the top of every page detailing potential issues, noting that delays were entirely normal, and linking to our pre-election coverage. We set a 50% reporting threshold before adding indicators for "leading" in races. And we built new visualizations for the national results that would emphasize votes and margin over geography.

In discussions with political editors, I often got the feeling that they thought we were being paranoid. As a result, we often compromised on the degree of appropriate caution. After January 6th, I think we weren't paranoid enough.

Where we went wrong

The events of the last few years have highlighted a lot of flaws in journalism that were always there, but can no longer be glossed over as "tradition" or standard practice (e.g., a lot of crime reporters suddenly discovering that the police lie to them, and have for years). Similarly, nothing we did wrong in 2020 is particularly new. Election result displays have been problematic for as long as I can remember, but this year threw those flaws into sharp contrast.

  • Counts and tabulation

Let's start with the most fundamental level of results: tabulation. Our conceptual language around tabulation is often broken. We speak using terms like "leaning," "trending," or "shifting," with the implication that the numbers are fluid or reacting over time. But of course, when the polls close, all the votes are set. Their totals do not change. We just haven't completed the process of addition.

Every election I've ever worked, we've warned readers that the early results are deceptive: they come from rural precincts, tend to lean GOP politically, and do not include the majority of the population of the state (which live in urban areas). We treat this as though it's an education problem, when it's actually an editorial problem. If the early results are untrustworthy, why report them at all?

It doesn't make any sense to talk about what the early vote counts look like when you've just started the adding process. It would be like making a statement about what a puzzle looks like when you've only assembled the edges. It's certainly not clear what the value of announcing percentages for 10% of tabulated votes is, other than it helps fill time on air, and it contributes to pointless drama in political coverage.

Those conceptual biases towards narrative are not just textual, they're also visual.

Take this set of graphs from the New York Times. These charts are not about vote count (which, remember, only goes up), they're about vote share over time, which is meaningless. We don't decide elections based on who was ahead at 25% or 50% of ballots counted. The thing that matters is the total.

Tabulation is also not a statistically random sampling process--there's an order bias to it. You can't look at tabulation and say "there's a trend here," but that's exactly what these graphs try to do. Remember: Biden won Pennsylvania and Wisconsin!

Ironically, the NYT rolled out a new iteration of its much-despised needle for predictions in 2020, and justified it by saying that "incomplete election results are often deeply unrepresentative." Not unrepresentative enough to avoid plotting over time to the tenth of a percent, apparently.

  • Early calls

We spent weeks, as a news industry, telling people that there were going to be delays, that those delays were normal at the best of times and could be exaggerated by absentee voting during a pandemic. All of this caution was thrown to the wind on election night.

By 9 p.m. Eastern time, 23 states had been called by the AP, totalling 195 electoral votes. A third of the country was essentially crossed off only two hours after the first polls closed. Seven more states (and DC) were called by the end of the night.

These calls often feel baffling even if you're aware of how the process works. The AP called Virginia early into the night, 36 minutes after polls closed. At the time, the reported totals had Trump leading the vote count. As a result, our voting graphics at NPR presented the bizarre sight of a solid-blue VA drifting out far to the right of all the other called races, and Alabama performing the same magic trick on the left.

Not only do election desks call races before a majority of votes are totalled, they often call them before any votes are counted publicly. It's a regular headache for newsrooms using AP election feeds: The data will mark a winner as called with 0 votes cast, which is hard to explain to readers. At The Seattle Times, we intentionally ignored AP calls for this reason, and only marked them manually.

There's no real ethical or legal requirement that forces news organizations to call states so early. In fact, something we often forget is that all these results are unofficial: Most states do not certify results until late November, and certification was not complete for all states until December 11 in 2020.

Election calls are not, generally speaking, proven wrong. I have the utmost respect for the statistical wizards who make them. But in an environment where a candidate has spent months insisting that the contest is rigged, it's difficult to conceive why calls can't be delayed until they're in sync with the tabulated results—which usually only requires an additional hour or two. To do otherwise is at best disorienting, and at worst contributes to an atmosphere of conspiracy and distortion around American democracy.

  • Visual accuracy and perspective

This is our preferred map at NPR for electoral results. It's a straightforward cartogram, attempting to place the states roughly in geographic space so that people don't have to search (a problem with hex maps). It was really gratifying how many newsrooms ran cartograms for results last year.

Unfortunately, this is the map we had as the front of NPR's election pages for the first time in a decade.

This map is a lie. It represents electoral votes using geographic space, dramatically overstating the importance of places like Montana or North Dakota, which are mostly empty space, compared with states like Virginia or New Jersey. This is not only inaccurate, it's politically-biased, playing into political distortions about "real America" and "real Americans." We misled our audience when we put this map at the top of our results.

Even worse are county displays (this map is from Fox News, but other organizations used similar graphics). Counties don't mean anything as an electoral concept, at least for national office. Even in states with split electoral vote assignment, it's done by district, not by county. Most of those red counties are dramatically underpopulated, compared to the blue urban areas. There's no rational purpose in showing a map of the US with counties marked, unless you want to mislead people about "how much" of the country votes a certain way. Not coincidentally, Trump was known to have handed out county-level results in 2016 as part of his argument that he actually won most of the popular vote (he did not).

County and state maps don't represent people, they represent dirt. Dirt doesn't get a ballot.

 The failure of results is tied to systemic issues in data journalism

Elections are a natural data-oriented story, which means a lot of us have ended up working on them as an annual (or bi-annual) ritual. Especially if you're in a newsroom where you don't feel particularly respected, being a crucial part of one of the year's big stories (and the nightly pizza) means a lot.

But that same eagerness means that we have not, historically, been particularly good about pushing back on poor choices, or even thinking about the choices that we make. If you're used to getting crumbs, you don't complain much about the provenance of the cake.

It's also sometimes hard to push back because a lot of the bad ideas are really interesting technical challenges. Building county-level results that perform well is hard. Building a page with a bunch of live countdown counters is hard. Building predictive models that can power a shaking needle is hard! I legitimately hate election coverage, and yet the code that I wrote for 2020 is some of my most elegant, well-designed work.

You can put off thinking about the ethics of an action for a long time if you get caught up in the mechanics of it.

  • Some data stories are easier to assemble than others

It's easy to forget how monumentally difficult the process of getting results is. Anyone who's worked in local news, or contributed to the Open Elections project, can tell you that every state has at least one horrible format for elections. An enormous amount of energy and effort goes into an AP results feed, and I genuinely respect that. But it means that we tend to build around the data that's easy to get.

By contrast, there are datasets that you can't get through AP, DecisionDesk, or Edison:

  • Average wait time to vote
  • Average voters assigned per polling place
  • Votes rejected or deferred
  • Voters challenged, and by whom

These datasets are not about the contest, they're about the accountability of the process. Which, I would argue, is just as important.

This other data is hard to gather. Again, all voting data is hard to gather. Lots of data is. But look at things like the Washington Post's police shootings database—just because it's hard doesn't mean we shouldn't do it. And of course, if we do it regularly, it gets easier.

  • What do election results actually measure?

We tend to present election results as divorced from outcomes. But elections have consequences. People get deported. They die in increasing waves of climate disaster. They're denied gender-affirming medical care. They get sick during a pandemic. They get arrested and shot by racist police departments.

You can't see that in our results, because they treat the election as a spectator sport we're just bloodlessly observing.

I've become convinced election results are the ultimate View from Nowhere, and I'm increasingly uncomfortable working on them.

So what do we do about it?

Ideally, I would like us to stop displaying real-time results altogether, and we should stop running predictions as a part of our coverage. These things are bad for our relationship with our audience, they're bad for journalistic standards, and they're bad for democracy.

But that's not going to happen in most places (unless you are a politics editor or a reporter with a lot of pull, in which case let's talk). The addiction to the traffic numbers and the newsroom habits are just too strong. So how do we convince our organizations to be less harmful?

At SRCCON 2021, I asked participants to think about four challenges for better results:

1. Visual results - Sketch a visualization for results that presents them responsibly.

2. Data sources - Think about datasets that you have, or would like to have, that could be merged or juxtaposed with election results.

3. Pre-election coverage - Imagine story ideas or topic areas that would benefit Americans in the run-up to an election. Keep in mind who reads your coverage—and who doesn't.

4. Reporting on power - Elections are not just about who wins or loses, but who will face material consequences as a result. How can our reporting reflect that?

In response, participants spoke about ways to bring new context to our elections. For example, reporting on misinformation should stress historical context, noting that the myths about voter fraud have persisted without evidence for decades. Likewise, coverage of state and local election laws that influence elections needs to be a priority in newsrooms—not just before the election, but afterward. As new, restrictive voting laws are passed, retrospectives about their impact on elections will be a crucial part of holding the process accountable and enabling democracy to recover.

While it may be tough to convince newsrooms to change their approach overnight, we can start to move the Overton window within our field toward equity. In order to start this shift, newsrooms can take some concrete steps right now:

First, consider the election as a process, not an event. Plan for sustainable coverage before and after election night. Align your engagement goals toward creating loyal audiences from your coverage, not just capitalizing on a burst of traffic and attempts to "win the night." On data teams, diversify your data sources to bring in rich additional context for readers around all of voting, not just a count of current votes.

Next, slow down and reduce drama for drama's sake. Hold results displays until a large portion (say, 50% or 75%) of votes are counted, so that you're not just showing people the wild swings from the early returns. Reconsider maps and visuals that play to narratives of a polarized country, and instead highlight issues like the time delay of full tabulation, population density and voting patterns, or who doesn't vote (and why).

Finally, make a place in your reporting for service journalism. Think about how to answer reader questions in a way that strengthens democracy. Where do I go to vote? What are my voting rights? How do I check if I've been dropped from the polls? How do I guard myself against laws and activism meant to disenfranchise me?

This article was written by Thomas Wilburn and was originally published on OpenNews Source. It is republished here with permission.

]]>
Putting data analysis into your radio programme https://datajournalism.com/read/longreads/putting-data-analysis-into-your-radio-programme Tue, 18 Oct 2022 13:00:00 +0200 Robert Benincasa https://datajournalism.com/read/longreads/putting-data-analysis-into-your-radio-programme Usually, journalists conduct data analysis to provide evidence to their audiences. Whether it’s a connection between lobbying and policy or an inequity in health or education, data will help tell the story.

On the radio, presenting data analysis and having it come across as evidence by listeners can be challenging. Those of us who write for the radio know that listeners can be less than fully attentive. They’re driving, they’re washing the dishes, or, they’re listening part of the time to the story and part of the time to a conversation with someone. That’s why, in an audio piece, good storytelling is often optimised with short sentences and simple ideas, served up briefly and one at a time. And often, we tell stories through scenes, environmental sound and characters.

Data analysis, on the other hand, is dry and quantitative. It might take a few minutes to explain the results, or even just the methodology of your data analysis. A few minutes might be all you have for an entire piece.

With so much complexity at hand, how do you do that?

Let’s start with the idea of measurement, which is what data analysis is really all about. When we measure something, we believe we’re giving our audience something of value: We’re telling them how much or how many, and in comparison to what or who. That said, measurement is never a goal in itself. It’s complementary to on-the-ground reporting.

One approach to reporting data is embedding one’s measurements within the narrative of the story, by associating it with a character. Ideally, this would maintain the listener’s focus on the human element, which is likely why they care about the story.

With data analysis and traditional reporting happening at the same time and driving each other, it adds up to this:

Here’s an example of how we get there:

Traditional reporting and data analysis working together to tell a story.

This story about U.S. coal mines that were delinquent on violation fines was driven by an extremely complex analysis. It involved several databases, and a multi-stage methodology for comparing injuries in coal mines with delinquent fines from violations found by federal regulators.

The complexity came mainly from making sure that safety records were analysed appropriately based on the status of fines, during periods when fines were delinquent. We also measured the severity of violations and injuries to make sure we were being fair to mine operators.

This snippet of the radio script shows how we integrated our findings into the story narrative. The data findings are embedded in a character and scene -- an interview with Mary Middleton, the widow of a coal miner who died in an unsafe coal mine:

  • HOWARD BERKES (NPR): The co-owner of the mine also controlled eight other mines. Federal mine safety records show even after he and his partners failed to pay the fines for the deaths at Kentucky Darby the other mines were cited for 1,300 more violations, according to Labor Department data. New fines totaled $2.4 million, which also went unpaid.

  • M. MIDDLETON: Where's the breaking point, you know? I mean, I know the Bible says vengeance is God's. He will repay. But you think why are they not being punished?

  • BERKES: They have plenty of company according to multiple sets of federal data and records analyzed by NPR. We obtained the Mine Safety and Health Administration's detailed accounting of mines with overdue safety fines. We then compared those records to 20 years of Labor Department data showing mine injuries and violations. And here's what we found - 2,700 mine owners owe nearly $70 million in delinquent safety fines. Most are years overdue, some go back decades. And get this - those mines with delinquent fines, they're more dangerous than the mines that do pay with an average injury rate 50 percent higher.

Middleton remembers how her family was affected by playing this old Christmas home video with a singing Elvis soundtrack.

Middleton is a compelling character and she provides a real-world anchor for the data. The alternative is to present an abstract number that asks listeners to do math in their heads.

The data findings are also presented incrementally, in small bites. Sometimes words are used “years overdue,” and “go back decades,”– instead of numbers.

And when numbers are used, they are round figures, with no decimal places.

As data analysts, our obsession with precision is admirable, but we must remember that any number we derive is an estimate, and will be imprecise by definition. When we ignore that fact and report a number carried out to two decimal places, we might be engaging in false precision.

For example, if we find that something is 64.2 percent of something else, writing “nearly two thirds,” is usually preferable on the radio. Most of us have a readily accessible idea of what two thirds means and can think about it without losing focus on the story. Also, “two thirds” is probably an appropriate level of precision.

Similarly, in a graphic, the use of decimal points on graphs, charts and maps is almost never justified by the precision – or, in other words, the error -- of the measurement tool.

On the web, there are usually more opportunities to write more complex sentences and amplify data findings graphically. And thus web and audio versions of the same story might actually be pretty different.

In the web story about delinquent coal mines, our findings were presented in a more detailed fashion, in part through a bulleted list of findings.

Teamwork makes the dream work

When you’re deciding how to present data in your story, especially on the radio, you’re generally not working alone.

More often, the story is the work of a team that might include data journalists, non-data editors, developers, visual artists and producers who focus on sound. All of them typically have some of the skills the others specialise in, so that’s a lot of potentially competing perspectives.

With all that in mind, let’s look at dialogues that took place on two different teams I worked on, developing radio and web projects at my news organisation, NPR, or National Public Radio. One project was about a connection that, at the onset, seemed to ring true.

We know generally that climate change has been making the United States hotter. We also know that more workers who labour outdoors in the heat were dying on the job. There must be, we thought, a connection - maybe even a causal one -, between those facts. We agreed to look for a link, supported by data, between hotter temperatures and rising deaths. Of course, that wasn’t the only goal of our analysis; we wanted to understand as much as possible about heat-related deaths and the climate conditions present when they occurred.

The project was a collaboration between NPR, local member stations and Columbia University, there were several reporters, data journalists and editors involved, as well as a climate scientist.

It quickly became clear that we weren’t going to connect climate change to death rates. It can take many decades of noisy weather data to detect climate change - and even if we had the data, it wouldn’t make much sense to compare deaths in today’s workplaces to those from decades ago. Still, we pursued government databases on heat-related occupational deaths, and found the highest number were in California and Texas, and in the construction and agriculture sectors.

On a Zoom call early in the reporting, some of the dozen or so team members remarked that the problem of heat-related death was “worse,” in those states and in those industries. As we each contributed to the discussion, the data people were unanimous that the data did not indicate that.

-“I’m not comfortable comparing states,” one data journalist said.

-“The comparisons don’t work,” another chimed in.

They complained that the lack of denominators – such as aggregate hours worked in each industry in each state -- did not allow meaningful comparisons of death rates across the categories.

-“OK, so how can we compare states, then?” an editor asked.

This is, in my experience, a typical problem.

“California has the most”

I’ve been in many conversations where someone utters that sentence. California has the most cancer deaths, the most tech workers, or the most cars. Then, someone will counter, “California has the most of everything, because it has the most people.”

You might think this is pretty obvious stuff. But one reason this dialogue takes place at all is that to many non-data editors, raw numbers simply indicate the volume of something, and if that something is newsworthy, so is its volume. It’s factual, unadulterated information, and it’s easy for the audience to grasp quickly. To most data journalists, though, raw numbers lack the context of denominators, study design issues, measurement errors and other factors.

To data journalists, raw numbers may be just proxy measurements for differences in the size of the population studied. So, what one journalist sees as simple, another sees as simplistic, or worse, misleading.

There’s merit to both perspectives, depending on the subject. A raw number of deaths, for example, is always significant in human terms. A life is a life, and no qualification is necessary to understand its value. That said, a death rate would be required to support a claim that some problem had increased mortality. If I had to generalise, I’d say data journalists are more caught up in the weaknesses of data analysis. Editors, on the other hand, tend to focus more on the strengths of the data, and how they can add evidence to the story.

Moving forward, the heat project group took the objections of the data people to heart and we looked for denominators showing hours worked by industry and state. What we found wasn’t robust enough for the analysis and we agreed that we wouldn’t be making those comparisons in the story.

Suggestions for your radio reporting:

  • Include measurements in the story's narrative.

  • Make use of a compelling character to serve as a real-world anchor for the data.

  • Don’t embrace precision for its own sake, or fall victim to false precision. Those decimal places in a number probably aren’t necessary and only slow down your readers and listeners.

  • Remember that raw numbers almost always need context. Typically, that means a denominator, a percentage or a proportion.

Beat the heat: Hot days and hotter days

Next, we decided to try to understand the temperature and humidity of the environment where worker deaths took place over a ten-year period. Columbia University climate impact scientist Cascade Tuholske told the team about a database from the University of Oregon called Prism.

Prism divides the country into cells measuring 4 km on a side and contains weather data for each cell. So, we placed the location of each worker's death inside the appropriate cell. After doing that, we decided to take 40 years’ worth of temperature data for the cell and the death date and determine whether it was unusually hot on the day the worker died.

I suggested we generate a percentile rank for the high temperature on the day of death, within the distribution of high temperatures over the four decades. Most of the time, the deaths happened when it was unusually hot. Most happened when the temperature was in the top quintile for temperature for that date. There were other findings as well, including the fact that 90 degrees Fahrenheit was a tipping point in the data. Most deaths happened when the high temperature for the day was over 90.

Climate scientist Tuholske thought that wet bulb temperature readings for the day would play a big role in when deaths happened, because he was aware of other research where it was important. But our analysis didn’t find that.

Some of our findings were suited to the radio script, and others to web visuals. Some things, such as the comparisons we made about states in our discussions, would have been misleading had we included them.

Ultimately, the 90-degree inflection point in the data was used in the radio script as part of the story’s lead anecdote – a migrant worker who died in a corn field. This was written not as a sterile number, but in personal terms, as part of the experience of the worker who died.

-“It was hot. At least 90 degrees. He had one bottle of water and no shade,” we said in the script.

The distribution of temperatures on the death dates and their comparison to the distribution of high temperatures over time, was considered a bit too complex for the radio script. It became a web graphic.

And while we didn’t report comparisons of death rates by state or industry, we did decide to simply state that Hispanics made up a third of the heat fatalities we examined despite being only 17 percent of the U.S. workforce.

Disaster aid helps wealthier homeowners

In 2019, I worked on a project that analysed 40,000 homes that were bought out by the government after flood disasters.

In the course of the analysis, I found a place where buyouts were concentrated, a section of a New Jersey community called Lost Valley.

On the radio, I focused on the local effects of the buyouts: changing racial demographics and a funding shortfall for schools because fewer homeowners were paying taxes.

Lost Valley’s story was used to put our national data analysis into context.

On the web, my colleague and I expanded the data storytelling with a visualisation about greater frequencies of extreme climate events and an interactive graphic about the inequities that disaster victims experience.

We also put the database itself on the web, in searchable form.

Here is a static screenshot of the interactive. Click here for the interactive itself.

Expected deaths in U.S. federal prisons

This example is an analysis done mainly by my colleague at NPR, Huo Jingnan. She was tasked with determining if more incarcerated persons in the U.S. federal prison system were dying during the COVID-19 pandemic.

One thing that was clear early on, was that while we might be able to determine how prison death rates had changed, we wouldn’t be able to tie it directly to COVID-19 mortality: the data we had was annual deaths by age with no consistent details on cause of death.

First, Jingnan looked at the death rates of prisoners by their age groups for five years before the pandemic. Then she looked at the prison population in 2020 and calculated how many people would die if the death rates stayed the same. She found that, if the rates had stayed the same, about 300 people would have died in 2020. But in reality, 462 people died.

I read her data results and sent her a Slack message about her wording. Here’s our exchange:

On the phone, she told me that she used a group of several age-specific death rates to calculate the figure for “expected deaths” in 2020. So, we explained that in the radio script:

On the other hand, the web story casts things differently, mainly because one of the editors did not like the explanation of expected and actual deaths used in the radio story. Instead, the editor wanted to use the term “age-adjusted death rate,” because she believed audiences would be more familiar with that - even if we didn’t explain it.

Jingnan had not calculated that kind of overall rate, but after some research into methods, did so by weighing the death rates by the percentage of population each age group represents, then adding up the weights. That rate also was about 50 percent higher. In the end, the web story said:

  • The federal prison system has seen a significant rise in deaths during the pandemic years. In 2020, the death rate in prisons run by the BOP was 50% higher than the five years before the pandemic. Last year, it was 20% higher, according to the NPR analysis of age-adjusted death rates.

Thus: on the radio, the methodology drove the wording of our section on death rates, while on the web, the preferred wording drove the methodology.

Lessons learned

In these cases, a diversity of opinions and a willingness to tailor content to the platform made the projects better.

Here’s a Venn diagram I used in my presentation about this project at the IRE 22 conference in Denver this year. A video of the panel discussion is linked below.

The slice of the diagram that represents analysis findings and other reporting that didn’t make it into either the radio or web stories might seem like a chunk of wasted time, but it’s not.

To be sure, data analysis aims to generate evidence and editorial content.

But it also teaches you what kinds of findings aren’t strong enough to publish. Once you know that, you’ll write with more authority and nuance. In short: knowing what you’ve left out, and why, informs your decisions for what to leave in.

When working with data visualisation artists, remember that they might see the story (and the world) differently. And, they may have strong opinions about what works visually. This can be an advantage, because it allows the possibility of another storytelling avenue that brings in audience members who think visually.

Some readers who may not have related to an audio or text story will be much more interested in exploring a data visualisation.

Putting data on the radio is only hard when you present it in isolation. When you use data to add dimensions to a well-defined story character or theme, you’re more likely to communicate your findings effectively.

Recommended resources

  • Book: Sound Reporting: The NPR Guide to Audio Journalism and Production, By Jonathan Kern. This book by longtime NPR journalist Jonathan Kern has traditionally been given to every new NPR reporter. It offers invaluable advice on how to craft compelling radio stories.

  • Video: Data on the Radio panel, IRE 22, Denver, Colorado.

  • Howard Berkes’ radio investigations: My former NPR colleague and collaborator’s stories blend personal reporting with the results of complex and ambitious data analyses.

  • This American Life: The Giant Pool of Money. This series uses longform radio journalism to tell the complex story of the U.S. housing crisis of 2008.

  • This web database of defendants in Jan 6 Capitol insurrection defendants augments radio reporting with deeper information about the accused. It’s updated continuously to reflect ongoing news coverage and developments in their criminal proceedings.

]]>
Data’s Role in the Disinformation War https://datajournalism.com/read/longreads/data-role-in-the-disinformation-war Wed, 07 Sep 2022 08:00:00 +0200 Sherry Ricchiardi https://datajournalism.com/read/longreads/data-role-in-the-disinformation-war A self-described “college nerd” sat on a porch in Birmingham, Ala., explaining via Zoom how he runs one of the most-followed Twitter feeds on the war in Ukraine. Around 272,000 regularly check his account The Intel Crab.

Justin Peden, 20, is an example of how data is being used to debunk disinformation in today’s high tech ecosystem. He uses geolocation, satellite imagery, TikTok, Instagram and other sleuthing tools to monitor the deadliest conflict in Europe since World War II.

Scouring the Internet for streaming Webcams, smartphone videos and still photos to pinpoint Russian troop locations, air bombardments and the destruction of once peaceful neighborhoods is a routine part of his day. If a Russian commander denies bombing an area, Peden and other war watchers quickly post evidence exposing the falsehood.

“I never dreamed in a million years that what I was doing could end up being so relevant. I just wanted to expose people to what was going on [in Ukraine}. I really am just a regular college kid,” said the University of Alabama – Birmingham junior. His Twitter profile photo is a crab holding a Ukrainian flag.

Open source intelligence has become a potent force in a conflict the United Nations describes as “a grave humanitarian crisis.” Online detectives like Peden use data to break through the fog of war, operating on computers thousands of miles away. Their impact has not gone unnoticed.

“The intelligence gathering, fact-checking, and debunking is happening in real time. The online crowd is also documenting the movement and placement of Russian troops, creating something more than a snapshot of recent history. It is often actionable intelligence,” said veteran science journalist Miles O’Brien during a PBS – Public Broadcast Service -- program in April.

On the air that day, O’Brien singled out Peden as “a highly regarded practitioner in the fast-growing field of open-source intelligence, or OSINT” and noted that his postings on Ukraine are followed “outside and inside the intelligence community.” The Washington Post included him in a story on the “rise of Twitter spies.”

When the Russians invaded on Feb. 24, Peden combed through images on social media, using metadata embedded in still photos and video to pinpoint time and place they were taken. He learned a valuable lesson along the way.

At one point, he received an image taken from a balcony in the southern port city of Kherson, showing what appeared to be Russian troops on the move. He verified the image and posted the exact coordinates on Twitter.

Suddenly, he realized the tweet might have placed a Ukrainian in danger of being identified by the enemy. When he deleted the post minutes later, it had already been retweeted 100 times. He no longer geolocates content in Russian occupied areas.

Truth is first casualty

What is happening now in Ukraine is nothing new. Disinformation has been a factor in conflicts and dictatorships dating back to the Roman Empire. Hitler and Stalin were masters at it. There is the saying, “The first casualty of war is truth.”

Today, however, there is a major shift in the equation.

With the click of a mouse, anybody can transmit false information to the entire planet, no matter how dangerous, malicious or intimidating. The invasion of Ukraine is a textbook example of how digital untruths fueled a humanitarian crisis and fomented hatred that has led to death and massive destruction.

PBSs O’Brien noted in a broadcast, “We are seeing a war unfold like never before. What once might have been kept secret is out there for all of us to see. The real secret now? Knowing who to trust and what to believe.”

O’Brien’s comment places journalists at the heart of the debate. Technology enables the spread of falsehoods. Open source intelligence helps set the record straight.

It is important to note that disinformation differs from misinformation in that it not only is false but false as part of a “purposeful effort to mislead, deceive, or confuse.” In short, it is content intended to harm.

Journalists strike back

Historically, media have played a crucial role in debunking falsehoods about major events, from conspiracies about Covid vaccines, to climate change, immigration and, most recently, Russia’s invasion of Ukraine. Germany’s Deutsche Welle (DW) is a prime example of how a verification system can expose actors with a malicious intent to inflict damage.

In the run-up to the war, DW’s fact-checking team began compiling a file of false claims and propaganda from both sides in the conflict and publishing corrections. They also made a startling discovery. Fakes were being put out under their name.

In July, they reported “Pro-Russian fabricated posts pretending to be those of the BBC, CNN and DW are fueling the mis- and disinformation war between Russia and Ukraine.” The story cited an example from a Japanese Twitter network. Here is an excerpt:

"It looks like a DW report," a Twitter user comments in Japanese on an alleged DW video about a Ukrainian refugee who is claimed to have raped women in Germany — serious accusations against a man named 'Petro Savchenko'.

The Twitter user writes: `Please share with me the URL of the original video.’ The user seems to doubt the origin of the video — and rightly so. It is not a DW production. It is a fake.”

Among other examples from the DW website: When a Twitter user posted a video purporting to be a live broadcast from Ukraine, a formation of fighter jets could be seen swooping over an urban area. Using reverse image technology, fact checkers revealed it was from a 2020 air show near Moscow. Another video allegedly showing fierce air-to-ground combat between Russia and Ukraine was traced to a 2013 computer game.

DW turned to scholars and practitioners for suggestions on how to make fact-checking more effective. The advice is relevant to journalists anywhere in the world. Among the tips:

  • “Emphasize correct information rather than amplifying claims” (“Consider using truth sandwiches: first state what is true, then introduce the truthless or misleading statement and repeat the truth, so the falsehood is not the takeaway”
  • “Provide unambiguous assessments (and avoid confusing labels like `mostly false’)”
  • “Avoid drawing false equivalencies between opposing viewpoints”
  • “Situate fact checks within broader issues – don’t just focus on isolated claims“
  • Analyze and explain the strategies behind misinformation – connect fact checks with media and information literacy”

This list also prevents reporters from being duped into spreading false and misleading information. To the deceivers, any amplification of their message in mainstream media is the ultimate success. It gives their lies oxygen and authenticity.

The “Ghost of Kyiv,” a false story about a heroic Ukrainian fighter pilot, made it into the Times of London, a home run for the fakers. Viral video showing the Ghost shooting down a Russian plane was viewed over 1.6 million times on Twitter. The video was from a video game simulator released in 2008.

Russia’s propaganda model

Gaining a better understanding of how propaganda techniques work to undermine truth is another way to disarm spin masters. A Rand corporation report on the “Russian Firehose of Falsehood” is a good place to start.

The title refers to a strategy “where a propagandist overwhelms the public by producing a never-ending stream of misinformation and falsehoods.” Even flagrant lies delivered rapidly and continuously, over multiple channels, such as news broadcasts and social media, can be effective in molding public opinion, according to the report.

Published in 2016 at the height of the U.S. presidential election, this analysis provides a road map to how Russia’s disinformation system operates. At the time, Russia was being accused of dirty tricks to influence American voters.

“The report is very much on target for what is going on today. Bucket after bucket of nasty propaganda is being dumped on us,” said social scientist Christopher Paul, the report’s co-author. His research includes counterterrorism, counterinsurgency and cyber warfare.

The report outlines and analyses four main components of the Russian model:

  • High volume and multi-channel
  • Rapid, continuous, and repetitive
  • Lacks commitment to objective reality
  • Lacks commitment to consistency.

The Russians command a powerful arsenal of disinformation tools.

Besides the usual, such as social media and satellite imagery, a vast network of internet trolls attack any views or information that runs counter to Vladimir Putin. They infiltrate online discussion forums, chat rooms and websites along with maintaining thousands of fake accounts on Twitter, Facebook and other platforms.

Their mantra: Repetition works. “Even with preposterous stories and urban legends, those who have heard them multiple times are more likely to believe that they are true,” said the report.

The Rand study offered best practices on how to beat the Russian Firehose of Falsehoods, among them, “Don’t direct your flow of information directly back at the falsehood; instead, point your stream at whatever the firehose is aimed at, and try to push that audience in more productive directions.”

Other tips included:

  • Warnings at the time of initial exposure to misinformation.
  • Repetition of the refutation or retraction,
  • Corrections that provide an alternative story to help fill the gap in understanding when false “facts” are removed.

“It all goes back to journalistic standards. All journalists really need to turn the screws is to be as professional as possible. Double-checking, verifying sources, confirming attribution, using data to be accurate and reliable. The burden of truth, the burden of evidence is much higher,” said Paul, a principal investigator for defense and security-related research projects.

Research by a disinformation team at the Stanford Internet Observatory (SIO) supports that notion and provides fodder to data journalists. Led by scholar Shelby Grossman, they identify how to spot disinformation trends in the Russian- Ukraine war and defend against them.

Following is a sample of their findings:

The trend: Old media circulating out of its original context

Grossman saw a video on her TikTok feed of a parachuter recording himself jumping out of a plane. It appeared he was a Russian soldier invading Ukraine. In fact, the video was from 2015.

How to spot: If something seems suspicious or outrageous, use reverse image searching to verify. Upload a screenshot of the photo/video into the search bar of Google Image or TinEye to check where else it might have appeared.

The trend: Hacked accounts

A Belarusian hacking group took over Ukrainian Facebook accounts and posted videos claiming to be of Ukrainian soldiers surrendering.

How to spot: “Sometimes the name of the account is changed, but the handle – the username often denoted by the @ symbol – isn’t. Advised Grossman: “Just spending 10 seconds looking at an account, in some cases one can realize that something is weird.”

The trend: Pro-Kremlin narratives

Before the invasion, claims began circulating that the West was fueling hysteria about impending attacks in order to boost President Biden politically.

How to spot: Look for reports out of Russian state-affiliated media. SIO reported that both Facebook and Twitter try to label these accounts, including some that are not commonly known to be connected to the Russian state.

Grossman would like to see platforms be more transparent and proactive. “I think that would be useful and important. It gives people information about the political agenda of the content and might give them pause before sharing”, she said in SIO’s March report.

Veteran policy expert Kevin Sheives has another view. He believes civil society, not government and social media companies, is better suited to fight back against disinformation.

“We are looking for solutions in the wrong place. The campaign against disinformation should have civil society at its core,” said Sheives, associate director, International Forum for Democratic Studies, National Endowment for Democracy.

He points out that social media platforms and governments are not designed to prioritize values over business or national interests. That leaves it to journalists, fact-checkers, community groups, and advocates to create counter-disinformation networks.

Countering disinformation networks

“TikTok algorithm directs users to fake news about Ukraine war, study says.” This headline appeared in The Guardian, March 21, 2022.

An investigation, conducted by NewsGuard, a website that monitors online disinformation, discovered that a new TikTok account “can be shown falsehoods about the Ukraine war within minutes of signing up to the app.”

Among NewsGuard’s findings, “At a time when false narratives about the Russia-Ukraine conflict are proliferating online, none of the videos fed to our analysts by TikTok’s algorithm contained any information about the trustworthiness of the source, warnings, fact-checks, or additional information that could empower users with reliable information.”

How NewsGuard did it: Researchers created new accounts on the app and spent 45 minutes scrolling through the For You Page, stopping to view in full any video that looked like it was about the war in Ukraine, according to the report.

Around the same time, the BBC listed several categories of misleading content about the war appearing on TikTok, describing it as “one of the leading platforms for snappy false videos about the war in Ukraine which are reaching millions.”

A TikTok spokesperson noted the company has added more resources to fact-check Russian and Ukrainian content, including local language experts, and beefed up safety and security resources “to detect emerging threats and remove harmful misinformation.”

Since the invasion, many social media platforms and messaging services have taken steps to block state-sponsored or state-affiliated media or add labels to alert users to the source of the information. The jury is out on how well these efforts will work to improve transparency and credibility of information.

The stakes are high. Misinformation and disinformation can have life or death consequences and undermine the democratic way of life. Data journalists are in the thick of this expanding field of digital warfare. Are they up to the challenge as more sophisticated methods of deception sweep the globe?

Resources that can help

]]>
How to preserve data journalism https://datajournalism.com/read/longreads/how-to-save-data-journalism Mon, 18 Jul 2022 00:00:00 +0200 Bahareh Heravi https://datajournalism.com/read/longreads/how-to-save-data-journalism News organisations have longstanding practices for archiving and preserving their content. The emerging practice of data journalism has led to the creation of complex new outputs, including dynamic data visualisations that rely on distributed digital infrastructures.

Traditional news archiving does not yet have systems in place for preserving these outputs, which means that we risk losing this crucial part of reporting and news history.

Taking a systematic approach to studying the literature in this area, along with experts in digital archiving preservation, Kathryn Cassidy, Edie Davis, and Natalie Harrower, I studied the implications of the new types of content as the output of data journalism with respect to archiving and preservation of these content, and looked into potential solutions that we could borrow from other more established disciplines such as data and digital archiving, software and game preservation and so on.

In a journal paper we published, we identify the challenges and sticking points in relation to preservation of dynamic interactive visualisations, and provided a set of recommendations for the adoption of long-term preservation of dynamic data visualisations as part of the news publication workflow, as well as identifying concrete actions that data journalists can take immediately to ensure that these visualisations are not lost. Here I take you through some of the problems we identified in our study and the recommendations for preventing further and permanent loss of content.

Evolving technology threatens preservation of new forms of content in different ways.

Traditional journalistic outputs were usually published in text and audiovisual format, with news organisations having a longstanding history of archiving and preserving these outputs on various media.

This included paper, tape, or hard disc drives, depending on the historical time period and the original format of the output. Similarly, institutions such as national libraries and archives generally hold large and long standing newspaper archives.

Data journalism and its enthusiastic uptake in the past decade, however, has opened up a new set of challenges for preservation and demands for new guidelines and practices. The output of data driven journalism still includes traditional text and audiovisual formats, but also it includes data visualisations and/or news applications.

Many of these visual elements rely on digital infrastructures that are not being systematically preserved and sustained as traditional news archiving has not accounted for these dynamic and interactive narratives.

These visualisations communicate key aspects of the story, and without them, in many cases the story is either incomplete, or entirely missing, and so is a part of history.

At the same time, an increasing number of such new, complex outputs are being generated in newsrooms across the world every day, and it is expected that this trend will continue to grow. Without intervention, we will lose a crucial part of reporting and news history.

Where is the problem coming from?

Data visualisations are one of the core outputs of data journalism. They could be in the form of static image files (e.g. jpeg, gif, png, etc.), but in many cases they are dynamically generated at the time of viewing, by computer code.

For example, many of interactive data visualisations these days are JavaScript based, such as those made using D3.js libraries, or online and/or interactive data visualisation tools that are written on top of JavaScript libraries, such as Datawrapper, Flourish, Charticulator, Carto, Mapbox and so on.

These data visualisations are hosted on online web servers and possibly outside of the news organisation. If the code behind the visualisation breaks, the server goes offline, or the link between the publication website and the server hosting the visualisation breaks, then the visualisation disappears or renders an error.

We consider any visualisation beyond a simple image to be a dynamic data visualisation. As such, all interactive data visualisations are considered dynamic. Such dynamic content cannot be captured by existing tools and methods of archiving, such as tools for archiving web pages or images and videos, and consequently are being lost.

Dynamic data visualisations are essentially software, and their preservation therefore should include methods suited for software preservation.

My colleagues in the preservation domain consider these dynamic data visualisations as ‘complex digital objects’.

These are distinguished from ‘simple’ or ‘flat’ objects such as image and video files, as they are more challenging to maintain and preserve for long term and sustained access, because they rely on complex digital infrastructures that contain a series of technical (inter)dependencies, where each part of the infrastructure must function in order to deliver the final output.

Simple objects are more likely to be maintained long term, because they fall under existing preservation methods used within news organisations since the beginning of the 20th century.

In contrast, the many infrastructures that support ongoing access to dynamic visualisations are not being systematically sustained or preserved in a way that would ensure access to data journalism outputs.

In many cases, the organisation that creates the visualisation, and holds an interest in its preservation (the news organisation), is not usually the same organisation that holds the key to that visualisation’s sustainable accessibility.

Without intervention, we will lose a crucial part of reporting and news history.

Evolving technology threatens preservation of new forms of content in different ways. Here, I list the four primary factors that we identified in our research to endanger the preservation of data journalism outputs:

  1. Third-party services: Many data visualisations make use of third-party data visualisation tools, such as Datawrapper and Flourish, which provide useful and often sophisticated assistance in creating visualisations.

However, the use of these tools creates risk because of dependencies on the tool provider: the tool may not be maintained by the provider, changes made to their underlying technologies may ‘break’ the connection to published visualisation on a news site, or the service might disappear altogether.

This has already come to pass with the shutdown of Silk.co and Google Fusion Tables, both data visualisation services once popular with data journalists.

In the case of Silk.co the website closed on short notice, ceasing access to any data visualisations that had not been exported or migrated by creators prior to the shutdown.

Dynamic data visualisations are essentially software, and their preservation therefore should include methods suited for software preservation.

A similar scenario happened a year later in December 2018 when Google announced that they would retire their Fusion Tables service.

Fusion Tables were one of the tools behind many early examples of Data Journalism, such as the Wikileaks’ Iraq war logs or the UK Riots in 2011, published by the Guardian.

FIGURE 2. Screenshot from October 2021 of The Guardian story, depicting how the content gets lost when the third-party services are not maintained.

FIGURE 1. Screenshots from the Guardian story, depicting how the content gets lost when the third-party services are not maintained: www.theguardian.com/news/datablog/2010/oct/23/wikileaks-iraq-data-journalism. Screenshot taken on 5th August 2020.

Both stories were early exemplars of Data Journalism as we know now, and manifested in many talks, tutorials and introductions to Data Journalism, including Simon Roger’s TEDx Talk on ‘Data-journalists are the new punks’. I still play the video of his talk in my classes, but none of the maps, the core of these stories, are there.

Google Fusion Tables was switched off at the end of 2019, and much of the associated content disappeared. The Guardian examples mentioned are only two of many stories with missing visualisations across news organisations in the past number of years.

2. In-house tools:

While many workflows rely on third-party apps, some organisations have also designed in-house tools.

These may afford greater control over the tool and its integration with internal technologies, but often these tools have been designed for specific purposes, such as to communicate the data behind a given data-driven piece.

The longer-term use of the tool or its maintenance may not have been considered during the design process, or no strategy has been put in place to track, archive and preserve the output of such tools.

Additionally, these tools are often developed by a small number of (if not one) interested news nerds in the organisation, who may not stay in the same organisation for long, and the continued usage or maintenance may completely vanish with the departure of individual(s) involved.

3. Content Management Systems:

The public-facing website of a news organisation is usually fed by a backend Content Management System (CMS), which itself is regularly maintained, updated, and periodically replaced by new platforms.

Through these changes, the embedding functionality that connects the visualisation to the CMS can be broken or rendered incompatible. In this case, the visualisation and/or the tool remain intact, but the visualisation is not fetched or displayed properly on the news organisation website.

For example iFrames have been one of the common ways to embed data visualisations created with external online tools into stories. An iFrame essentially creates an opening on an HTML page, which can pull content from external websites, including visualisations created in a range of external websites, such as Datawrapper and Flourish, or the above Google Fusion Tables in the Guardian stories.

Most online data visualisation tools provide iFrame embed codes, which the journalist can simply copy and paste to their organisational CMS.

The smallest change in the iFrame or embed code management in the CMS could break this link. In such a case, the content remains hosted externally, but the content will not be shown on the publisher website.

4. Myriad of other technologies:

While the above risks point to significant changes in known aspects of the technology chain, there are other dependencies that underpin visualisations, such as particular programming languages, libraries, databases, hosting platforms and tools.

These change over time – by the news organisation, the tool provider, or globally – and changes can cause the data visualisation itself to no longer be accessible or viewable.

An example of technological change can be seen in the consequences of Adobe’s decision to retire Flash. In countless stories published around and before 2010, such as The Guardian’s articles on Earthquakes, or The Financial Times’ Banks’ Earnings, the visualisation itself was the article.

So their disappearance due to the deprecation of Flash resulted in empty pages, with the now-useless suggestions to download or update Flash Player as shown in the images below.

FIGURE 3.1. Screenshot taken in February 2021 from The Guardian story, depicting the disappearance of the full story due to the deprecation of Flash.

FIGURE 3.2 A screenshot taken in February 2021 from The Financial Times

A 2010 paper by Edward Segel and Jeffrey Heer studied 58 visual stories from several publishing houses in their research on narrative visualisation.

Unrelated to their findings, I note that most of the visualisations they studied are no longer accessible. It happens that at the time of their research, Flash was the go-to technology for creating interactive visualisations.

Just 10 years after this study, Flash Player was deprecated and consequently very few of the visualisations remain accessible. Flash will not be the only casualty, as preferred apps and scripts continue to change over time.

In addition to the large-scale failures, all digital objects, simple or complex, are in danger of degradation or loss over time, due to factors such as data corruption (bit rot) – the obsolescence of file formats, software and hardware – and the limited lifespan of storage media.

For all of these reasons, it is imperative that news media prioritise digital preservation.

A screenshot image of a message from Adobe explaining support for Flash Player ended in December 2020.

How to tackle these problems

The findings in our study identified several obstacles, ranging from specific technical challenges to broader social and organisational issues. You can read about the details of it here.

But in short, two broad approaches emerged from the preservation methods:

1) Preservation of visualisations in their original working form

This approach entails keeping a working version of the visualisation available through methods such as emulation, migration, and virtual machines.

An important category emerged with respect to this approach includes the discussion of specific tools for preservation. The tools used for this purpose mentioned included ReproZip, which is primarily aimed at reproducible scientific research, and provides functionalities for collecting the code, data and server environment used in computational science experiments.

Other well-developed tools exist to capture entire webpages or websites. Examples are WebRecorder and the International Internet Preservation Consortium (IIPC) Toolset, comprising the Web Curator Tool and the well known and open source Wayback Machine.

In the Data Journalism Handbook 2, Meredith Broussard proposes that ReproZip could be used in conjunction with Broussard & Boss, 2018's article, a web archiving and emulation tool for preserving news apps.

While the web archiving tools may provide a useful starting point for preserving dynamic data visualisations, they are not always able to capture highly interactive data visualisations or those embedded which rely on server-side applications and data, such as those embedded via iFrame or other embedding features.

This is because the code is actually sitting somewhere outside of the current webpage. Furthermore, capturing the web through this method (which creates Web Archiving – WARC files) is difficult and complex and not likely to be implemented as part of journalistic workflows.

Additionally, there are a variety of other preservation, workflow management and configuration management tools according to articles by Chirigati et al., 2016; Steeves et al., 2017.

While the existing approaches towards keeping a working version of the visualisation in its original form through available web and software archiving, emulation, migration, and virtual machines are not specifically aimed at archiving dynamic data visualisations, have mixed results when capturing interactive content, and are complex and expensive to implement and maintain, they could shed a light on tools necessary for archiving data visualisation.

They could also provide valuable directions for future preservation of dynamic and interactive data visualisations in data journalism.

The user interaction and experience may be key to the meaning and value of a given data visualisation.

2) Flattening the visual

The second approach attempts to capture a “flat” or simplified version of the visualisation via methods such as snapshots, documentation, and metadata.

A flat or simplified version, considered in digital preservation language under the category of ‘surrogates’, essentially turns dynamic visualisations from complex digital objects into simple objects, such as images, GIF animations or videos, which are more easily preserved.

The dynamism is not maintained, but an effort is made to capture a sense of the original visualisation to preserve at least some part of it from total loss.

How to choose? Significant Properties

In choosing which of these approaches is most suitable for a given dynamic data visualisation or a given story in news and journalism, in our paper we draw on the concept of ‘Significant Properties’ of digital objects, originally proposed by Margaret Hedstrom & Christopher A. Lee in 2002 as their response to archiving of digital items in relation to their original physical object, such as a physical book being archived in digital format (on microfilm!), or when digital objects were converted from one format to another.

The idea was that the digitised version of a book, for example, may not be capable of preserving all of the properties of the original hard copy materials, such as accurate colour representation or the exact physical dimensions of the originals.

Significant Properties are those properties of digital objects that affect their quality, usability, rendering, and behaviour. These are typically technical or behavioural characteristics of the digital objects, which need to remain unchanged when the file is accessed in the future, in order for the file to fulfil its original purpose.

In the case of image files this might include aspects such as the height, width and colour depth of the image, while for video content it could include aspects such as the playback length and frame rate.

It may not be necessary to preserve the entire interactive data visualisation in a working form.

Software and other interactive digital objects tend to have more complicated significant properties relating to their behaviour and the types of possible user interaction.

Computer games, for example, are inherently experiential: the experience of the game is a significant property of the application. This can also be the case with data visualisations. The user interaction and experience may be key to the meaning and value of a given data visualisation.

On the other end of the spectrum, interactivity may not provide vital value to the visualisation, rather the information conveyed through different interactive elements may be considered the significant properties. Or it could be somewhere in the middle.

The first step here therefore for us would be to identify these significant properties, put next to the time and resources available, and go forward in relation to our preservation methods accordingly.

Where interactivity is a Significant Property, an approach using techniques such as emulation or migration may be indicated, as this preserves a working version of the original visualisation, and is thus more likely to preserve all of the significant properties of the object.

This approach would be in line with recent recommendations by the Digital Preservation Coalition on preserving Software.

On the other hand, for some interactive data visualisations, dynamism and interactivity are not significant properties of the object, and much of the message is communicated without these aspects.

In such cases, it may not be necessary to preserve the entire interactive data visualisation in a working form, as an approach using snapshots and documentation as surrogates for the original may satisfactorily retain the significant properties.

Identifying whether and to what extent these are significant properties of a visualisation can help in selecting which approach to take in its preservation. But these must be considered alongside other resource and workflow requirements and limitations for preservation.

Non-technical challenges

Regardless of the technical approach taken to preservation, several systemic methods could be drawn on from recognised topics in the digital preservation domain. Overall, our research indicates that the complexity of the task of preservation is the biggest obstacle to preserving these objects.This complexity is not limited to technical aspects.

Rather, it is in part attributable to the wider cultural or organisational challenge of digital preservation, where resources - financial and human - are limited, preservation is not embedded in publication workflows, and advocates for preservation are few and far between.

Furthermore, the responsibility for these actions must be identified and pursued systematically. Awareness-building around preservation, guidelines for preserving visualisations, and training on how to integrate preservation into workflows can assist with these larger social or organisational challenges.

Recommendations for going forward

Recommendations for Immediate and Practical Interventions by Data Journalists

Here I start with a set of immediate and simple actions that could be taken by data journalists to ensure partial preservation of the content they are producing now, in lieu of more robust approaches to be developed and implemented widely in the future.

These are approaches that assume limited time and resources combining a basic identification of significant properties, along with the creation of surrogate output of types that are easily preservable using current technologies, such as images and audiovisual formats.

This is essentially a basic form of the snapshot method identified in the literature.

If a number of screen grabs in the form of a GIF animation cannot do justice to the visualisation, then consider creating a video cast of the data visualisation in use.

We propose that for every dynamic data visualisation included in a story, the journalist should:

  1. Identify the significant properties of the data visualisation, in terms of the importance of the story at hand.

  2. If an image screenshot of the data visualisation could represent these properties to a satisfactory degree, then take a screenshot of the visualisation, and store it with other archived audiovisual content.

Screenshots have been used by some news organisations in their archiving practices. Figure 4.1 and 4.2 depicts two examples from The Washington Post and the New York Times, where the story is missing due to the issue of Flash, but the organisations offer access to alternate archived content.

FIGURE 4.1. Screenshots taken in February 2021 from The Washington Post depicting the disappearance of the stories due to the deprecation of Flash, as well as the accessible screenshots through their archives.

FIGURE 4.2. Screenshots taken in February 2021 from The New York Times, showing the inability to read a story thanks to its reliance on Flash.

Following the link in The Washington Post story retrieves a PDF, which had been previously generated for the print version of the story.

Clearly, this conveys an acceptable degree of the original story’s intention. However, the link in The New York Times story retrieves a screenshot that only shows the first slide of a multi-slide story, which means a significant part is missing.

  1. If an image screen grab cannot capture the story to a satisfactory level, then we propose two alternatives:

a. If a small number of screen grabs can tell the story, then create a GIF animation that includes these in sequence, and archive as above. GIF animations allow limited animation but are nonetheless relatively simple image files which are straightforward to preserve.

Many news organisations already create animated GIFs for content promotion on social media and so the tools and expertise are readily available.

The Economist data desk, for example, provided a workshop on From interactive to social media: how to promote data journalism at the 2018 edition of the European Data & Computational Journalism Conference, for which they create GIF animations to promote their interactive data visualisations on social media.

These GIF animations, in essence, capture some part of the significant properties of the original interactive data visualisation.

b. If a number of screen grabs in the form of a GIF animation cannot do justice to the visualisation, then consider creating a video cast of the data visualisation in use, highlighting the most important parts. A range of widely-available free tools can be used to create such video content which is also relatively simple to preserve.

These simple surrogate representations must also be linked to the original story to ensure that the reader can find them if the story remains available, but the original visualisation is no longer available.

This linking could be via a structural solution whereby the CMS of the news organisation allows an alternate link to be specified and automatically displays the file behind the link if the main visualisation fails to load.

An alternative or possible interim solution would be to include a link under each visualisation to the surrogate version which invites the user to click on it if the visualisation does not display correctly. An example of how this has worked in practice could be seen in Figure

  1. Creating an image, GIF animation or a video of your data visualisation is an uncomplicated solution that enables the capture of significant properties in terms of content as story, providing a stop-gap until more systematic and sophisticated methods for preservation of dynamic data visualisations are in place. In addition to long-term preservation and access, this simple method could also cater for issues associated with loading complex objects across devices.

These recommendations address the need for an organised and sustainable approach to the long-term digital preservation of data visualisations.

As such, we also propose that every provider of data visualisation creation tools should ideally provide GIF animation and video exports, in addition to their current visualisation exports.

Many data visualisation providers promise their users that in the case of company closure, users will be given the option to download the code behind the charts.

This is a responsible offer, but most journalists will not have the time or skills to execute that code on a different platform. Nor will they be able to go back to every single story they created to update the server information for where the data visualisation is hosted.

Hence, it is advisable that journalists create simple exports of their data visualisations at time of publication, and provide the information for how these can be accessed if the original publication fails.

Both data journalists, and the wider digital preservation community, should advocate with vendors of these tools to help bring this about.

The preservation of the datasets that underlie data visualisations is also key.

These immediate and relatively contained measures could ensure that much of the data journalism currently being produced is not lost entirely, while the newsrooms find ways to implement the recommendations to ensure longer term systematic preservation of such complex objects.

In addition to these, in the paper, my colleagues and I provide a set of recommendations for systematic and more long term interventions. These recommendations draw on the systematic study of the literature in a set of relevant areas such as web archiving, digital preservation, software and game archiving, methods detailed in professional literature from the fields of data journalism and digital preservation, as well as our professional expertise as academics and practitioners in these areas.

Our recommendations for systematic, organisation and discipline based interventions fall into several categories, including guidance and education, infrastructure and tools, collaboration with trusted, local and national digital repositories and memory institutions, funding and resourcing, and legal frameworks.

These recommendations for long term and systematic interventions address the need for an organised and sustainable approach to the long-term digital preservation of data visualisations.

They aim to ensure that these increasingly important elements of journalistic output are routinely preserved alongside simpler forms of digital news media.

These medium to long-term actions require changes to workflows and investment into new policies, practices and technical solutions. As such, they require an investment of significant effort over time, financial resources, and collaborations that may expand the remit of existing institutions.

If you are interested to read more about these, read the Recommendations part of the paper.

Digital preservation is an ongoing process, not simply an endpoint.

I would like to note here that the scope of this article, and the research paper underlying it, includes works relating to the preservation of dynamic data visualisation and associated software code and dynamic digital objects.

The preservation of the datasets that underlie data visualisations is also key; in some cases, they are required to make the visualisation function as it is rendered. In any case, the data should be persistently accessible to verify the findings communicated by the visualisation. However, this is a separate, larger issue for digital preservation and is out of the scope of this article.

As a pointer and food for thought, the preservation of research data is being studied by international initiatives such as the Research Data Alliance and the CODATA committee of the International Science Council, which could provide valuable input into the preservation of data and code when it comes to data journalism.

Digital preservation is an ongoing process, not simply an endpoint. Methods must evolve within and by the communities that are most invested in the long-term stewardship of their outputs.

Because of journalism’s fundamental and unique contribution to the historical record, it is imperative that preservation is built into the production of data journalism, so that this key element of the record is not lost.

Bahareh Heravi is a Data and Computational Journalism researcher, trainer, practitioner and innovator. She is currently a Reader in AI and Media at the Institute for People-Centred AI at the University of Surrey in the UK. Bahareh is a member of the Irish Open Data Governance Board, and a co-founder and co-chair of the European Data & Computational Journalism Conference. She previously was an Assistant Professor at the School of Information and Communication Studies at UCD, where she led the Data Journalism programme.

This article is a shortened and adapted version of an academic paper that Bahareh Heravi co-authored with her colleagues Kathryn Cassidy and Natalie Harrower from the Digital Repository of Ireland, and Edie Davis from the Library of the Trinity College Dublin.

For the full journal paper, and also for citation and referencing please visit the journal website.

]]>
This is where data journalists get their ideas from https://datajournalism.com/read/longreads/data-journalism-ideas Wed, 18 May 2022 08:00:00 +0200 Paul Bradshaw https://datajournalism.com/read/longreads/data-journalism-ideas Data journalists get their ideas in a range of ways — from questions and tip-offs to news events and data releases.

But if you’re new to the field you can often struggle to come up with inspiration. If you’re looking for data journalism ideas, then here’s a guide to different ways to generate them — and the types of stories they might produce.

1. Ideas come from new data releases

2. A news event is the spark for an idea

3. An example provides a template for an idea

4. A question inspires an idea

5. Tip-offs as a source of ideas

6. Exclusivity-driven ideas

7. Ideas driven by play

1. Ideas come from new data releases

Probably the best way to get started with data journalism is to work from scheduled data releases.

These are datasets typically published by public bodies, such as a national statistics body, ministry or local government, open data portal, or international organisation (the World Bank and UN are two examples).

New data releases solve two challenges with data journalism story ideas:

  • The “what’s new?” question (“new data says X” is the obvious new thing, even if the time period covered was last year); and
  • Getting hold of the data

(Other sources of ideas, outlined below, will require you to work harder on both challenges).

The downside of data releases as a source of ideas is that the release will also be seen by lots of other journalists — so you’ll have less time to turn that data into a story before it becomes ‘old news’.

For that reason data release-driven stories need to be relatively simple: you’re likely to be analysing data to find out the scale of something; how much it has changed; or how different areas or categories rank in terms of a particular issue (e.g. which area or category is worst affected or where does your local area rank).

Data releases are typically published to a pre-announced schedule, which can be found on the organisation’s website by searching for “data release calendar” or similar terms, along with the organisation’s name.

It’s likely that there will be previous releases of the same data that you can look at, for example.

This will help you anticipate what information will be contained, what sort of shape it will be in, and what sorts of techniques you’ll need to use with it.

You’ll also get an idea of the language involved — including any terms you need to better understand (e.g. how is “homelessness” defined? What is a “frequent caller“?). This can lead to background reading and research.

You can make sure that you understand how the data was gathered (for example is it based on a sample), and what it can and cannot say as a result.

Finally, you can research the topic itself, too, to understand what might be the most newsworthy dimensions of the data — and which people (experts, politicians, charities, representatives, etc.) you might approach for interviews.

All of this will mean you are better prepared to turn that data into a story more quickly — and do it better — than the competition.

2. A news event is the spark for an idea

A news story like this can trigger a data-driven follow-up to establish the scale or trend of similar events

Many data story ideas are driven by a news event. This prompts the journalist to look at data to put that event into context. Typical examples include:

  • A physical event occurs (e.g. a particular type of crime; an environmental event; an emergency event; a protest)
  • A verbal event where a claim is made (e.g. a politician blames a problem on a particular cause; concerns are raised about something)
  • A political event where an announcement is made (e.g. a new policy, new funding, or new laws to tackle a problem)

Different types of news events can generate different questions. Physical events can prompt questions like “How common is this type of event?” or “Are there more of these events than there used to be?”

Similar questions might be asked in relation to claims and announcements, but might also focus more on the basis for those statements.

For example: “Does the data support the claim/concerns?”; “Does the problem justify a new law/policy/funding?”; “Is that funding enough to make an impact?”; “How effective have similar policies/laws been before/elsewhere?”

A news event-driven data story is an example of “moving the story on” — it’s often done in newsrooms by seeking new reaction to a news event from someone newsworthy (e.g. a politician, industry representative, spokesperson or celebrity), or looking for new action being taken (e.g. by politicians or political bodies, charities or businesses), or new information being revealed.

In this case data is providing the new information that moves the story on.

In 2016, for example, five dead sperm whales were washed up on beaches in the east of England within days of each other. That prompted the BBC England Data Unit to look at data on how many of the sea creatures die every year on UK shores.

This data-driven story was a follow-up to the news story above on a beached whale.

The data story was completed within 24 hours, and that’s important: the window of opportunity for most news-driven ideas is small and the story will need to be completed quickly because any data journalism story is only likely to be newsworthy during the few days following the event — and in some cases, only on the same day.

Most news event-driven data journalism stories, then, are likely to be technically simple and focused on similar calculations to data release-driven stories: finding out the scale of the type of issue that’s been in the news; whether events related to that issue have increased or fallen (or “failed to improve”); or an angle relating to ranking.

Anything more complex means the story will take too long, and it will no longer be newsworthy because that event, claim or announcement will have slipped off the news agenda.

That doesn’t mean the end of your idea. Events can recur, of course, and claims or announcements will often be followed up later by further announcement or claims, so you can prepare data in advance in anticipation of those events.

3. An example provides a template for an idea

If you want to spend more time on a data journalism story, one way to come up with ideas is to look for an example of data journalism that can be applied to a different place, category, or time period.

Freedom of Information (FOI) stories are a good example of this.

If someone wrote an FOI story a few years ago about the amount of bike thefts in your area, for example, you could:

  • Update the same idea by repeating the FOI request, this time asking for the most recent few years
  • Apply the story to a different area (you could look at the wider region as a whole, for example, or see how your area ranks nationally)
  • Adapt the idea to a different category of crime, such as car theft (different category)

Many other data journalism stories can be adapted in the same way.

You might see a story about the increase in floods, for example, and consider adapting it for earthquakes, or bringing it up to date, focusing it on a particular country, or expanding it to a global angle.

The second story shown here may well have been inspired by adapting the idea from the first, which appeared 3 months earlier.

The challenge, however, is to make that story newsworthy. Unlike data release-driven ideas, where the newness of the data gives you a news hook, or like news event-driven ideas, where the event gives you the topical interest in the subject, an example-driven idea may not necessarily be newsworthy.

FOI stories have an advantage here: they are exclusive, and their news hook tends to be based on the fact that they “reveal” something. A typical intro might use the phrase “figures from X reveals” or “analysis/an investigation by [your publication] has revealed”.

But if you are using other examples as a template for your own it’s important to ask why the original story was published.

It may be that there was some news event, or new dataset, that prompted it.

It may be that the issue is much more important in that area or category than in the one you plan to apply it to.

Understanding what made it newsworthy in the first place can help you to identify if it’s newsworthy now — or what might make it so again.

For example, it might lead you to find out that some new data is out soon, or to prepare a story for the next time the particular type of event occurs (if it’s an event that happens regularly enough to be confident that will happen).

One useful strategy when the data isn’t new enough to be newsworthy on its own is to focus the story on some reaction to what you find.

For example you might get an interview with a charity or politician to comment on what you find, so that your story can lead on their ‘call for action’ or ‘concerns raised after data shows X’. (Occasionally it might go even further and action will be taken after you share your findings, which makes an even stronger story)

If you can’t find something new about the issue you’re about to to dig into — rethink your story. It might be that there’s not a strong enough justification to do it.

Put it to one side and look at other ideas instead — you’ll probably come up with a better one.

4. A question inspires an idea

A story like this may well start with the headline phrased as a question.

Some of the best ideas in data journalism come from a simple question: are women’s pocket sizes really as ridiculous as they seem? How widespread is discrimination against people in benefits in the rental sector? How have Europe’s prisons fared in the Covid-19 pandemic?

The resulting story is likely to be the answer to that question as a news story about what you reveal (e.g. “X% of properties won’t accept people receiving welfare”), or an exploratory feature (e.g. “Here’s how prisons were affected by the pandemic”).

Questions have the benefit of originality, and can be quirky too — but they come with the risk that no data exists to answer that question.

The most important stage with a question-inspired idea, then, is to scope what data exists, and how practical the story might be.

Ask yourself:

  • How would information be recorded about this activity?
  • Who would record it?
  • Is that one organisation or multiple ones? (For example different private companies)
  • How likely would they be to share that data?
  • Has anyone conducted surveys about the activity based on a representative sample?
  • If direct information doesn’t exist, what proxy data might exist? (i.e. data which indirectly measures that activity, such as people mentioning it or searching for it online, or data about secondary effects of the activity)
  • Would I be able to collect enough of this data myself if I have to?

If your question is “How much wrapping paper is thrown away every Christmas?“, for example, you are likely to quickly come up against the problem that when a person throws away wrapping paper no one records it.

The next step might be to identify if an organisation has collected survey data which answers your question.

You might contact charities who have campaigned on the issue of waste (e.g. environmental charities); or search Google Scholar for any academics who have researched the field and might be aware of any data.

Be prepared for any data that you do find this way to be unsuitable for your purposes (at least on its own). Surveys may have been conducted some time ago, or with a very small sample, or in a different country. They might still have some relevance to your story idea (they might be worth a sentence), but you’ll have to admit they’re not strong enough to carry the story on their own.

Another option is proxy data. Proxy data has been widely used to tell different stories about the impact of COVID-19. As I wrote during the first months of the pandemic:

“Air pollution data, for example, can be a proxy for transport activity; energy consumption data can be a proxy for economic activity; waste collection data is a proxy for people moving away or working elsewhere. A spike in people dying at home can raise questions about what that indicates. Social media chatter and search trends are regularly used as proxies for behaviour, too.”

An alternative to proxy data is to collect the data yourself.

In the case of the story about pocket sizes, Jan Diehm and Amber Thomas made measurements themselves of “80 pairs of blue jeans from the most popular and widely available brands in the US … at brick and mortar stores in Nashville, New York, and Seattle.”

For the story on discrimination in the housing market I scraped data from thousands of adverts for rental properties.

When ITV News wanted to do a story on the impact of funding cuts in schools they asked the National Association of Head Teachers to distribute a survey to thousands of school heads, and then asked students on the MA in Data Journalism at Birmingham City University to help analyse the results.

In each case it’s important to collect enough data: spend time considering what is going to be a big and representative enough sample of the phenomenon you’re trying to tell a story about (a survey of a few teachers via your social media isn’t going to cut it, for example).

You might also consider focusing your story on the lack of data itself. In 2016, for example, I worked on a story about how many schools were failing to publish transparency data, while one British Medical Journal investigation focused on the fact that "medical schools are failing to monitor racial harassment and abuse of ethnic minority students". Missing Numbers contains dozens of stories about gaps in public data, and the books Invisible Women and Data Feminism provide many examples of stories focusing on how a lack of data has real world impacts.

None of these are small projects, so it’s important to decide if the story justifies the effort involved (and will still be newsworthy by the time you’ve got the data you need), and if you’re passionate enough about the project to see it through to completion.

5. Tip-offs as a source of ideas

Tip-offs are essentially ideas that come from other people: “You should look into X” or “It would be great if someone found out Y” or “My organisation has seen a big increase in Z”.

They are similar to question-driven ideas but have a couple of key qualities which lead me to treat them separately:

  • First, tip-offs tend to suggest that there is something of note to find (whereas questions may not lead to something newsworthy); and
  • Second, the person doing the tipping off may also be able to provide advice in obtaining data or interviews (see exclusivity-driven ideas below).

But the problems facing tip-offs are also the same: it may be difficult, or not possible, to get hold of the data that you need.

You will need to ask the same questions as with question-based ideas: about how and where such data might be recorded (including surveys); who may hold the data; how likely they would be to share it; what proxy data might exist; and whether it’s practical for you to compile data yourself if none exists.

6. Exclusivity-driven ideas

Sometimes ideas will come from a dataset that you have obtained exclusively. This can happen in a number of ways:

  • An organisation approaches you with some data that they’d like to share exclusively
  • An individual approaches you with some data that they’ve obtained that no other journalists has access to (for example, a leak, or data that they’ve scraped, or data from some unpublished research)
  • You approach an organisation or individual to obtain data for a story

Strictly speaking, in this last type of exclusivity the story idea has already been identified — it’s not generated by the exclusivity of the data. But the exclusivity does give the idea extra value.

But in the other types of exclusivity you may still need to find story ideas in the dataset you’ve been exclusive access to. The provider may think the data is inherently newsworthy — but its exclusivity alone won’t mean it contains a newsworthy story, or justify the time and effort involved in transforming it into one.

Major leaks such as the Pandora Papers, LuxLeaks and the Wikileaks war logs touched on major conflicts and powerful individuals, justifying massive international collaborations and months of work.

But exclusive data from an academic study, or market research by a corporation, will require careful consideration before you commit resources to a story.

And even apparently juicy leaks from a whistleblower can turn out to reveal nothing new — only information that is already public.

The key questions to ask when being offered exclusive data are: “What does this tell me that is newsworthy?” and “What’s their motivation for giving it to me exclusively?”

It’s worth bearing in mind, for example, that exclusivity is sometimes used to try to manipulate journalists (a reporter may be offered an exclusive interview on condition of “copy approval”, for example).

Exclusivity can also be created artificially in order to appeal more to journalists.

Worst of all, it can cause a reporter to treat the material less critically, as the exclusivity of the story transforms it into something ‘owned’ by the news organisation themselves, and that they must defend.

In some cases a sunk cost fallacy can mean effort continues on a story even when it becomes apparent that it has significant flaws.

Suppliers of exclusive data are inevitably going to attach more value to it than the casual reader, either because it’s in their specialist field, or — in the case of a large cache of data scraped by a civic hacker — the skill involved in obtaining it (again, a potential sunk cost fallacy).

The Telegraph spent a week exploring the expenses data before deciding to invest time in reporting it.

The data journalist’s role is to remain sceptical about that value, and focus instead on what value it might have to the reader, in terms of the stories that it contains.

When the Telegraph newspaper obtained a massive leak of information on politicians’ expense claims, for example, they spent a week inputting that data to identify its news value before deciding to commit to it as a source of stories (the data went on to provide so much material that the newspaper dominated the news agenda for six weeks).

Drawing on the 7 common angles for data stories can help you assess the potential editorial value of the dataset. What stories can this data reveal about the scale of something newsworthy? About change?

Can it reveal where places, categories or organisations rank? Unfair variation between different parts of the country or society? Relationships that raise questions? Individual leads?

The dataset may provide the basis for an exploratory feature: when Art UK approached the BBC England Data Unit with an exclusive dataset on 200,000 of the nation’s oil paintings, we wrote a story on different dimensions of art that the data revealed (most were focused on ranking — but it also included the scale of paintings donated in lieu of tax, and the scale of works by unknown artists).

7. Ideas driven by play

What would a map of literary road trips look like? What if we could test out different conditions for a pandemic outbreak? Could we make a calculator so people know how much food they really need to stockpile?

These are ideas driven by a sense of play — of curiosity, experimentation, and exploration.

Play-driven ideas often draw on the interactive possibilities of data-driven storytelling: the fact that, once information is structured into data, it can be used to create tools, personalised experiences, ergodic storytelling, exploratory maps, simulations and games.

So ideas in this category tend not to revolve around stories as such, but formats where the story is generated by the user themselves through their own play — their user journey.

This is reflected in the way such stories are pitched to users through their headlines — a relatively new approach which you’ll need to get to grips with when explaining your idea.

It might raise a question, rather than answer it, for example. What Happens Next? invites you to explore COVID simulations; Based on a True True Story? invites users to explore visualised timelines of films true-ness; and Where My Money Dey? allows you to explore public spending.

It might issue a call to action (sometimes after raising a question), as MIT Technology Review do with Can you make AI fairer than a judge? Play our courtroom algorithm game. Or the BBC do with Check NHS cancer, A&E, ops and mental health targets in your area. The LA Times’s Every shot Kobe Bryant ever took. All 30,699 of them, in contrast, includes an implied call to action (“explore”).

It might simply hint at the journey that you are about to take. Reuters Graphics’s “Breaking the wave“, for example — itself a headline which implies a question — leads on “Measuring the death toll of COVID-19 and how far we are from stopping it.” Similarly Disease modelers are wary of reopening the country. Here’s how they arrive at their verdict suggests we will be following an exploratory path.

Some simply describe what the format is: Can it pass the Senate? Interactive Australian voting calculator, for example. Or Evacuating Afghanistan: a visual guide to flights in and out of Kabul, for example (note how it still frontloads the headline with a topical hook). Or People of the Pandemic: a hyperlocal cooperative simulation game.

Because of the interactivity involved, play-driven ideas are often (although not always) resource-intensive and technically demanding.

For that reason they tend to suit fields that the author (and audience) are passionate about, such as sport and music — or issues that are long-running (climate change; the pandemic) or regular (and major) enough to justify the investment.

This is why elections, budgets and quadrennial global sporting events tend to feature heavily in this category.

These areas are a good place to start if you want to generate a play-driven idea. How can you make an election or World Cup the basis of an interactive tool? How can you create a new way of looking at, or exploring, a long-running issue? How can you turn your passion for music into a data-driven feature that others can explore?

Pick the idea that suits your skillset and time

Ultimately there are pros and cons to all the sources of ideas outlined here. Some are quick but less impressive; others are complex but potentially eye-catching. The key is to pick a project which is achievable given the time and skills you have at your disposal.

If you’re just starting out with data journalism, data release-driven ideas are one of the best ways to get started: you can plan ahead, work with previous releases, and demonstrate core data journalism skills.

Once you’ve built those core skills you can challenge yourself further in a direction that fits with your professional objectives.

If you’re interested in coding and interactivity, try a play-driven idea; if your focus is more on exclusive newsgathering, try adapting an example of a previous idea using FOI, or cultivating sources who might provide you with exclusive access to a dataset.

And if you’ve already tried a number of the approaches listed here, use the list to identify one you’ve never tried before.

]]>
Humanising data: Connecting numbers and people https://datajournalism.com/read/longreads/humanising-data-connecting-numbers-and-people Wed, 29 Dec 2021 11:30:00 +0100 Sherry Ricchiardi https://datajournalism.com/read/longreads/humanising-data-connecting-numbers-and-people When ProPublica and National Public Radio partnered for the series “Lost Mothers,” they discovered an alarming trend: The United States had the highest rate of women who die during pregnancy, childbirth and postpartum in the developed world despite spending more on healthcare than any other country.

As reporters dug deeper into the data, a vital element was missing. Where were the mothers?

“When a pregnant woman or a new mother dies in the U.S., we discovered she is almost invisible. Her identity is shrouded by medical institutions, regulators, and state maternal mortality review committees. Her loved ones mourn her loss in private. The lessons to be learned from her death are often lost as well,” reported Nina Martin, who led the project for ProPublica.

A screenshot of the "Lost Mothers" series by ProPublica and National Public Radio.

To fill the gap, the investigative team created a first-of-its-kind national database of women who died from pregnancy-related complications.

They combed social media and crowdfunding sites like GoFundMe for leads and turned to obituaries and Facebook to verify information and locate family and friends.

They published a request: “Do you know someone who died or nearly died in childbirth? Help us investigate.”

“We knew the statistics,” said Martin, “but we didn’t have the human stories.”

Nearly 5,000 responses came from all 50 states, Washington, D.C., and Puerto Rico. These personal accounts were the backbone of the prize-winning series that spanned 2017 to 2020.

Among the key findings:

  • “The U.S has nearly double the number of maternal deaths per 100,000 live births compared to other wealthy, developed nations like France and Canada with roughly 100 percent more deaths per capita.”

  • Black mothers in the U.S. die at three to four times the rate of white mothers, one of the widest of all racial disparities in women’s health.”

  • “According to the Center for Disease Control, more than 60 percent of pregnancy and childbirth-related deaths in the U.S. are preventable.”

Lost Mothers won the prestigious Goldsmith Award for Investigative Reporting, the George Polk Award for Medical Reporting and was a Pulitzer Prize finalist for explanatory journalism. The project has been widely credited with sparking change in America’s health care system.

In the wake of the project, the U.S. House of Representatives unanimously approved a bill to fund state committees to review and investigate deaths of expectant and new mothers, a major step to addressing the shortage of reliable data on maternal mortality. A U.S Senate committee proposed $50 million to prevent mothers from dying in childbirth.

Don’t just throw numbers at people. That’s the worst way to go about it.

Giving faces and voices to Lost Mothers was “an absolutely conscious choice and a necessity,” said Martin. “People start to yawn if they don’t understand the implications [of the data]. They melt away. We wanted people to `see’ the story and react.” She has moved to Reveal from the Center for Investigative Reporting.

Scientific research supports Martin’s premise. Noted psychologist Paul Slovic uses the terms “compassion fade” and “psychic numbing” to explain how the brain responds to abstract numbers that have no human connection.

“If readers don’t relate to the information, they are less likely to act and use it,” said Slovic, a founder and president of Decision Research, a collection of scientists who study the human psyche. His advice: “Don’t just throw numbers at people. That’s the worst way to go about it.”

Breathing life into statistics

Slovic’s comments beg the question: How useful are datasets if they don’t resonate with the audience and make a difference? For instance, the United Nations reports millions of refugees are on the move. Charts, maps and trendlines document exoduses from places like Syria, Afghanistan, and South Sudan.

How often are data on mass migrations personalised in news reports? If displaced people were given a voice, a face and personality, would the international community pay more attention?

“People remember people. They don’t remember numbers. Quantitative information on its own means very little,” said Gurman Bhatia, a visualisation designer based in New Delhi.

During research on how the brain handles information, Bhatia stumbled on Slovic and his theory of compassion fade, the notion that people begin caring less when they are overwhelmed with data.

If readers don’t relate to the information, they are less likely to act and use it.

Slovic believes the best way to combat indifference is to tell individual human stories as reminders that behind every number is a real person. He expands this theory in a co-authored paper “psychic numbing, the insensitivity to large numbers of victims.”

Following is an excerpt:

“Large numbers have been found to lack meaning and to be underweighted in decisions unless they convey affect (feeling) . . . On the one hand, we respond strongly to aid a single individual in need. On the other hand, we often fail to prevent mass tragedies -- such as genocide -- or take appropriate measures to reduce potential losses from natural disasters. We believe this occurs, in part, because as numbers get larger and larger, we become insensitive; numbers fail to trigger the emotion or feeling necessary to motivate action.”

Case in point: When the body of two-year-old Syrian refugee Aylan Kurdi washed ashore in Turkey in September 2015, the image went viral, sparking an outpouring of aid for refugees and policy changes on migration.

The death toll in Syria numbered in the hundreds of thousands with scant international response. Suddenly, a tiny corpse face-down on a beach moved the public in ways statistics could not.

A screenshot of the Guardian article featuring images of Aylan Kurdi, the two-year-old Syrian refugee boy who drowned in September 2015 en route to Kos.

“Overnight, that picture woke up the world. People got emotionally connected to the problem,” said Slovic, a member of the National Academy of Sciences. “Generally, if there’s something people can do to help, they will do it. If they don’t feel they can make a difference, they get turned off. It is not enough just to break through the numbing.”

When the image of Aylan Kurdi surfaced, Sweden had around 160,000 Syrian refugees. The day after the photo appeared, donations to the Swedish Red Cross jumped from $8,000 to $430,000. A month later, they had gone back down.

Remember, the first four letters in numbers are numb. That is what you want to avoid.

To make a point, Slovic described an experiment where subjects were asked to think about an amount of money equivalent to $1, and to visualise that amount in American currency. They were told they could visualise 100 pennies, 10 dimes, four quarters, a silver dollar or dollar bill.

Overwhelmingly participants visualised a dollar bill instead of multiples like quarters or dimes. A single object was easier to envision and connect to. It was more difficult to think about the many.

Slovic advises journalists to:

  • Convey a strong connection to people in their stories
  • Personalise events through the eyes of those experiencing it
  • Put themselves in the shoes of those who are suffering

“Remember, the first four letters in numbers are numb. That is what you want to avoid,” the psychologist said.

Data and narrative journalism

The concept of humanising data has become a driving force for journalists like Tricia Govindasamy, senior data product manager for Code for Africa and an expert in Geographic Information Systems.

To her, a spreadsheet is “just numbers until a human face is put to them.” Data scientists and analysts generate statistics, but journalists hold the real power by using numbers in their stories and giving them a voice, said Govindasamy.

She coaches reporters to take a new slant on something they already do well – interviewing. When faced with a dataset, why not question the numbers as if they were a person, suggests Govindasamy, a data literacy trainer.

Interviewing the data can flesh out story ideas, identify new angles and lead to appropriate human sources. If a dataset documents death and destruction caused by massive flooding, the reporter will study the numbers to determine the following:

  • Where did the worst flooding occur?
  • Which villages were destroyed?
  • What did the villagers do for a living?
  • Where are they now? How many children died?

With that information in hand, reporters head to the scene to talk with survivors and medical and rescue teams in search of intimate details. Data becomes a guide to the human condition, helping create change, accountability and impact that otherwise might be ignored.

Govindasamy cited “The Pandemic Poachers” as an example of how humanising data made a story stronger and more appealing. InfoNile interviewed local communities in conservation areas on how the pandemic affected them as lockdown restrictions stopped tourism and impacted daily life.

A screenshot from the InfoNile's "The Pandemic Poachers" article. The chart shows the total reported and estimated weight of trafficked wildlife linked to East Africa from 2010 to 2020.

Here is the extended lead from the story:

“In one of the world’s pristine wildlife wildernesses, selling beads has helped Mdua Kirokor keep her kids in school. “Mdua Kirokor is a member of the Maasai pastoralist tribe living within the Maasai Mara, a world-renowned savannah in southern Kenya and northern Tanzania that is home to lions, leopards, elephants and a spectacular annual migration of wildebeest.

Since 2017, the income Kirokor has saved from beadwork has helped her pay school fees, buy livestock, and invest in home provisions including water tanks and a gas cooker. But while she intended to soon purchase land for leasing through a wildlife conservancy, her plans were halted after the pandemic ground tourism almost to zero.”

Journalism has a history of being a lone wolf sport. Data journalism is at its best when it is collaborative.

This brand of highly detailed storytelling may require new ways of thinking.

During workshops, Eva Constantaras, data journalism advisor for Internews, suggests data specialists break away from the tech team and get into the larger newsroom to work with reporters.

It doesn’t matter if these beat writers aren’t astute at technology or know how to code. They know how to engage readers and put pressure on governments to change policies. They are experts in their field whether it is education, politics, migration, or crime.

“Journalism has a history of being a lone wolf sport. Data journalism is at its best when it is collaborative. That is what I push for,” said Constantaras, who is based in Athens. Below are examples of projects that exemplify the relationship between numbers and human beings.

The New York Times' COVID-19 coverage

The New York Times won the 2021 Pulitzer Prize for public service for chronicling the toll of COVID-19 at home and abroad. It defined its coverage as "tough reporting, a huge data analysis effort and deep human storytelling." The project portrayed "A Nation of Mourners, Forever Scarred."

"We strove every day not to be so focused on the numbers that we forgot the people behind them," Marc Lacey, assistant managing editor, wrote in the Times. Following is an outline of how the Times did it:

A screenshot showing how The New York Times gathers and shares data. The Times’s database of COVID-19 cases and deaths was sourced from the websites of hundreds of state and county health authorities, using a combination of manual and automated tasks. Credit: Guilbert Gates/The New York Times

The following is an example of how the dead were memorialised in The New York Times:

“What Loss Looks Like”

  • Readers were asked to submit photographs of objects that reminded them of loved ones who died from coronavirus or other causes over the last year. The images and personal stories were published digitally as an interactive feature that became a virtual memorial.

A screenshot from the interactive piece from [The New York Times] on "What Loss Looks Like".

“Those We’ve Lost”

  • The Times obituaries editor solicited contributions from the newspaper’s bureaus in America and around the world. It informed readers, “This series is designed to put names and faces to the numbers.” Since March 2020, the series profiled more than 500; the project ended in June. An example from the list: “Yury Dokhoian, chess coach who guided Kasparov, dies at 56. A Russian grandmaster, he spent a decade working with the longtime world champion, who said Mr Dokhoian gave him “stability and confidence.” He died of the coronavirus.” A personality profile of Dokhoian was attached.

“Wall of Grief”

  • Toward the end of May 2020, a visualisation on the front page marked 100,000 lost with the names of people who died from the virus, most within a three-month period, and memories of their lives from obituaries.

Some of the people featured in the visualised obituary include a police detective in Harlem with a gift for interrogation, Cedric Dixon, 48, New York City; transgender immigrant activist, Lorena Borjas, New York City; advocate for disability rights, April Dunn, 33, Baton Rouge, Louisiana.

In February 2021, another page one graphic began with a single dot and grew to 500,000, each representing a life lost in the United States to the coronavirus. COVID-19 represented more deaths than in World War 1, World War II and Vietnam combined.

According to the Times, a goal for the project was to show “A Nation of Mourners, Forever Scarred” through videos, photos, and personal stories, all reminders of the void COVID-19 left behind.

A screenshot of The New York Times graphic where each of the nearly 500,000 individual dots represents a life lost in the United States to the coronavirus. Talking about the death toll, Lauren Leatherby, a graphics editor on the project, said the visual reflects “the sheer speed at which it was all happening.”

Putting faces on crime statistics

Florida’s Palm Beach Post has become a model for how tracking homicides works.

The website defines the project as “a means both of humanising and quantifying killings” and “presenting faces as well as facts.”

Most interactive entries include a photo of the deceased, the time and place of the killing, and a brief personal profile. Links at the bottom of the entry direct users to additional reporting on the case.

Maps show where the murder took place, down to the block on the street. The Post’s online database has tracked each homicide in the county dating to 2009.

A screenshot of Florida's Palm Beach Post interactive map showing where homicides took place.

An example from Homicides Tracker:

The body of Ryan Rogers, 14, was discovered near an interstate overpass along Central Boulevard in Palm Beach Gardens on Nov. 16, 2021. Investigators later ruled that the teen's death was a homicide. Authorities arrested a homeless man from Miami in connection to the murder. Do you have information to share about the life of Ryan Rogers? The Palm Beach Post needs your help. Email us at [email protected]

In September 2020, the Post published a review of 1,041 homicides in Palm Beach County as part of an investigation into the killings of Black males that disproportionately went unsolved when compared to females and males of other races.

Local law-enforcement agencies, medical examiner’s offices and death notices are the main sources for the numbers. Interviews with families and friends personalise coverage, giving victims a voice and personality.

In 2010, the Washington Post brought murder victims to life through Homicide Watch DC., one of the first databases of its kind chronicling killings. Today, crime tracking projects have become commonplace in the media.

The United Nations Global Study on Homicide provides journalists with a broad overview of murder rates, what regions of the world are the most lethal, and possible solutions. It is worth a read for reporters looking to build expertise.

Immigrants are not just another statistic. Every person has a story that numbers alone can’t tell.

Giving immigrants a voice

Mass migration has become major news throughout the world. In Syria, refugees flee bombs and torture. African migrants risk death to cross the Mediterranean to escape poverty and disease. America’s pullout from Afghanistan produced searing images of desperation and death.

How are these refugees being covered in today’s media world?

Shamim Malekmian quickly said yes when editors at the Dublin Inquirer asked her to create an immigration beat for the newspaper. From the beginning, she had a goal in mind: To bring a human perspective to coverage of refugees flowing into Ireland from places like Nigeria and South Africa.

“Immigrants are not just another statistic. Every person has a story that numbers alone can’t tell,” said Malekmian, who also writes for Hot Press, a Dublin-based music and politics magazine. Her specialities include covering the environment, climate and giving voice to the underdog.

In September, she wrote about dozens of migrant children that have gone missing while in the state’s care.

She reported on asylum-seekers in limbo during the long wait for interviews and racist attacks against people of colour in Dublin.

A screenshot showing a Dublin Inquirer article documenting the experiences of racist attacks against immigrants in Dublin.

Malekmian sees herself as a change agent, pushing the government to reform immigration policies she described as “archaic.” She found it “shocking” that so many immigrant children had gone missing without causing a stir among state watchdogs.

She quoted one child advocate as saying, “The full weight of the state should be brought to bear in trying to find a missing child.” Her reporting clearly showed this was not the case.

Her next move on the immigration beat?“I actually would like to locate some of these missing kids. I am focused on two girls missing from a hotel in Dublin,” said Malekmian said. “That is where I will likely go next.”

She offered the following tips on how to cover an immigration beat:

  • Always go beyond press releases and the official government line
  • Let the data guide you to stories and new angles
  • Establish strong connections with the people you are covering. It’s the best way to gain their trust.
  • Stay in touch with sources and follow up on their stories
  • On-the-ground reporting is vital despite limitations placed on the media by refugee centres

Past and present

As evidenced by the journalists in this article, pushing for humanising “numbing numbers” has taken hold in newsrooms across the globe.

Two years ago, a story in The Guardian reminded readers, “Over the past decade, our approach has evolved, and now we amplify the stories we find in the data by collaborating with specialist reporters to put human voices at the centre of our stories.”

The Guardian underscored a major point: “Behind every row in a database, there is a human story.” That message resonates with data journalists today.

Other resources that can help

  • Psychology Today: “How can we combat “compassion fade?” The article deals with making big numbers more meaningful and ends with a quote. “It is likely impossible to forget that people are dying [of Covid] every day. The challenge for us personally is to continue to care.”

  • Arithmetic of Compassion. Website established by Paul Slovic to raise awareness of psychological obstacles to compassion, including psychic numbing and pseudo efficacy. Provides suggestions for how to combat cognitive biases and tackle global issues such as mass atrocities, famine, and climate change.

  • Gapminder: Nonprofit, nonpartisan foundation offering free data visualisation tools and teaching resources for using and analysing global population statistics. Tools include bubble charts, maps, and databases on poverty, population growth, employment, environmental trends, and health statistics worldwide.

  • “How Can We Tell Migrants’ Stories Better? Here are 10 ways.” Bright magazine provides a road map to improving coverage of migrants including going where the story is, focusing on people, going beyond stereotypes.

  • “Photo essay: Painstaking Portrait of Some of New York’s Darkest Days.” Queens was hit hard by the coronavirus. A group of Times journalists provided an intimate picture of some who died. An example of how words and pictures work together to tell a powerful story.

  • “Data Journalists’ Roundtable: Visualizing the Pandemic.“ Four data journalists covering COVID-19 describe their approach to producing graphics, including a checklist to consider before publishing.

]]>
Wrangling the robots: Leveraging smart data-driven software for newsmaking https://datajournalism.com/read/longreads/wrangling-the-robots-software Mon, 20 Dec 2021 06:30:00 +0100 Monika Sengul-Jones https://datajournalism.com/read/longreads/wrangling-the-robots-software Imagine you’re a business reporter with a couple of hours to file a 500-word article about falling oil prices and the impact on multiple regions. The words are there, but you’re spending a lot of time pulling data. Too much, in fact, to do an interview.

Imagine you’re a local reporter for a regional news outlet tasked with covering multiple city council meetings that are happening on the same evening. How can you possibly be in three places at once?

Imagine you’re on the politics desk for a national newspaper and notice a peculiar phrase used by a politician at a press conference. You’re trying to get a sense of whether this is a dog whistle or a typo? How do you verify this?

In each of these scenarios, existing data-driven tools and software can help alleviate the pressure points for journalists in their day-to-day reporting.

One powerful thing that AI can do is write up and distribute summaries of city council meetings.

While investigative journalists are doing innovative reporting with machine learning, the uptake has been uneven in journalism more generally, reported Charlie Beckett, professor of practice and director of the JournalismAI project at London School of Economics. Many services are developed using statistical modelling, sometimes called machine learning or “AI-powered,” available to newsrooms from third-party vendors.

In 2019, Beckett led a survey of 74 journalism organisations from around the world and found most newsrooms trialling AI are in the northern hemisphere. Reasons for the slow onboarding are varied, but many are fearful of job loss or deskilling, while smaller newsrooms face challenges related to skill and resources.

While such fears are not without precedent, there’s optimism. In a podcast for Conversations with Data in July 2020, Beckett said, “If you want more time to get on with your journalism, then you better be in a newsroom where the boring bits are being looked after by a machine.”

How to leverage AI-powered services

Find out who is sharing what

For journalists concerned with what is being shared online, the biggest players in Big Tech don’t make social analytics easy to come by. The main service available, CrowdTangle, is owned by Meta, the company formerly known as Facebook. Familiar to many audience development editors and social media teams, the tool can also help journalists identify stories or stamp out misinformation.

Using CrowdTangle, journalists can identify engagement with content, including which social media accounts have shared a specific URL hyperlink or keyword across social media (Facebook, Twitter, Reddit, and Instagram). Also available to the public is a hub of Live Displays, which offer a real-time view of trends on social media.

CrowdTangle is a service from Meta that provides analytics about Instagram, Facebook, and Reddit. Above is a screenshot showing the tool in action.

Reporters and newsrooms can gain special access to more features from CrowdTangle by filing an access request. To get up to speed, journalists can check out a training programme from First Draft, a nonprofit that aims to protect communities from misinformation, to use CrowdTangle for research. Given how much a journalist's reporting relies on social media, such a tool is essential for verification and delivering fact-based reporting at speed.

The takeaway? Use it! It's free, and the barriers to entry are low. But like any data-driven service, the product won't give all the answers. CrowdTangle's metrics were decided by the parent company Meta. The Markup, an investigative newsroom, recently pulled back the curtain on CrowdTangle in "How We Built A Facebook Inspector".

They show that Meta does not provide impressions data or the number of times the content is shown to users, limiting what journalists can say about the information. In addition, keep in mind reporting on certain keywords or trends can inadvertently amplify them. Ethically speaking, this amplification could be problematic as it may unnecessarily spread uncertainty or fear to audiences.

Transcribe audio

There are a range of free and paid transcription services newsrooms can use to report on live events such as city hall meetings, especially if they are online. Otter.ai uses artificial intelligence to provide real-time transcription and notes in easy-to-use file formats and direct language. Trint offers similar services. Reporters can upload a recorded audio file from Zoom or another video conferencing service, and the tools transcribe the file in written text. Reporters can listen to the audio file and edit the transcription file within the document for accuracy. Files can be shared with team members or exported. Trint is available in 14 languages.

Above is a screenshot showing the Trint display for editing an audio file.

According to Aimee Rinehart, the AI program manager at Associated Press, these services may be helpful in places with a dearth of local media.

Rinehart referenced a conversation with Laura Frank, Executive Director of CoLab, about what small newsrooms are experiencing in southwest Colorado, where one news reporter is responsible for covering six counties.

“That’s a tremendous task,” said Rinehart. “AI can help. One powerful thing that AI can do is write up and distribute summaries of city council meetings. For instance, provide a transcript, a summary. The public needs information about civic life.”

“You look at things like voting habits, and they happen less in news deserts. The public needs information about civic life,” said Rinehart.

The takeaway: Transcription services are already a part of the workplace at many news media organisations. They save journalists time transcribing interviews, podcasts, or live meeting notes. In the case of “news deserts,” transcription bots could feasibly support newsrooms in doing the impossible: sending a reporter to three places at once.

Costs range from free to enterprise rates in hundreds of US dollars. While versatile and already a part of many newsrooms, the transcripts may require human-powered copyediting to be publishable.

Above is a screenshot of Otter.ai taking live meeting notes via Zoom.

Services and pricing: Otter.ai provides real-time transcription and note-taking in English. Pricing plans include a free plan to transcribe live, more services including transcribing files and using Otter.ai with third-party applications such as Zoom start at $20 USD per month.

Trint provides transcription services in 54 languages on mobile and browsers. There’s no free version, but Trint does offer a 7-day free trial. Pricing plans for one user are $48 USD per month, while plans for up to 50 users are $68 USD per month.

Automate parts of newsmaking

When freelance journalist Samantha Sunne worked at Reuters in the mid-2010s, she was tasked with writing oil industry reports. She spent a good part of her day tracking prices and preparing reports for a subset of subscribers involved in the oil industry.

“’Oil went up by a half-cent, Oil went down by a cent,’ that kind of thing,” said Sunne. “If I have a robot making the lead for the oil price change reports and a script inserting the data, then I have time to do an interview.” Fast forward a decade, and according to Beckett, many larger newsrooms have smart automation services in place that write copy on topics such as pricing or sports.

For instance, The Associated Press has partnered with Automated Insights, a natural language generation platform, to automate earnings reports and sports articles.

Investments in these services can relieve journalists from tedious and time-consuming work.

The AP is also developing services in-house. Currently available is AP VoteCast, which provides localised coverage of election day with automated stories and graphics

The AP is currently at work on more scripting services, said Troy Thibodeaux, AP's director of digital news, in an email. The aim is to make it easier for partner newsrooms to auto-generate copy for specific states using up-to-date datasets.

AI-powered software can be a helpful research tool. This is the hope of former journalist and author Maria Amelie, CEO and co-founder of Factiverse. This AI-powered browser add-on (try the prototype) fact-checks claims for journalists. Use the search bar to check specific claims or create a profile to fact-check an entire article. The source data is culled from the International Fact-Checking Network (IFCN). Amelie's experience in journalism motivated her to develop the product.

"Journalists are spending hours on research to avoid making mistakes. There's a great deal of pressure on newsrooms to deliver many articles and to be fast. But there's so much content that they need to review -- millions and millions -- so our task is to make some of that easier," said Amelie.

The takeaway: In general, automation services from third-party vendors cost money, and prices vary depending on the services. But investments can relieve journalists from tedious and time-consuming work. "On one hand, people don't want to get automated out of a job, but there's a place where drudgery can be relieved by AI," said Sunne.

AP leverages machine-learning, among other techniques, to automate localised coverage of elections for newsrooms with AP VoteCast.

Keeping robot writers at bay

AI-powered robot writers stoke both fear and excitement. In 2012, Wired wrote that, according to Kristian Hammond, Chief Scientist at Narrative Science, an AI and natural language processing software developer, 90% of news would be written by robots in 2026. A decade on, does automated writing have a place in newsrooms?

Abhishek Prasad, Head of Insights & AI and Automation at the HT Media Group, a mass media company based in Dehli, India, has experience trying to incorporate robot writers into newsmaking, and “believe me, we tried it, and the technology is absolutely not there.”

I piloted two robot writers to understand better the tools, which promised to turn a blank page into copy ready to edit. Given I was working on this article, the proposal was appealing, and I popped these keywords into the online AI writer Rytr.me. It costs $29 USD per month and offers a free trial:

Role of AI in the future of journalism. Artificial intelligence, bias, public good, moral compass, practical applications, diversity, clarity

Out came copy—not unconvincing—but inaccurate:

AI can take away all the bias from an article by detecting the correct keywords and synonyms. They also have a good moral compass to know what is in the public’s interest and what not to report. It is important to note that AI writers do not have a sense of bias, which is something that has been a concern when it comes to journalism in the past. With AI writers, there is no need to worry about bias in any shape or form.”

Above is a screenshot of the Rytr.me settings and copy from fall 2021 on the topic of “The Role of AI and the Future of Journalism.” Rytr is available in thirty languages and can write content for a variety of settings, such as blogs, emails, and social media posts. Here is what text came out:

It turns out this robot isn’t a terrible writer, but a terrible reporter. Most of the above are inaccurate. But the copy could feasibly fool a distracted reader. And it doesn’t take an ethicist to say that’s problematic.

Next, I tried Copy.Ai using the same keywords. This company was recently awarded $11 million USD in Series A venture capital. The robot writer costs $35 USD/month to use and offers a one-week free trial.

Pictured above is a screenshot of the Copy.ai settings and copy on the topic of “How to use AI for journalism.” The software offers several settings to adjust and is available in 25 languages.

The software pushed out several paragraphs of copy, including a section referencing Microsoft and Associated Press, but something seemed off. There was no context.

I stopped there to do an internet search, which led me to a 2017 Poynter piece by Sunne.

Above is a screenshot of copy written by Copy.ai.

Her story was summarised in a line, without references. Where did it come from? Who can use this writing?

I sent an email to Chris Lu, CEO of Copy.ai, asking about the training data and the rights to re-use. His reply was enthusiastic:

"All of our content is AI-generated! You don't need to credit Copy.ai at all, and it's your work, especially if you make edits after generating it with Copy.ai!" He confirmed by email that their robots are trained using about 10% of the internet, including Wikipedia articles.

"Those robots are straight up stealing," Sunne said jokingly in an email when I contacted her. Thieves, perhaps, but sloppy ones. It wasn't any December, but December 2017; AP didn't "join forces" with Microsoft but used a Microsoft application as a part of a "non-financial collaboration." The pilot is also over.

The takeaway: Robot writers can stoke the imagination with copy, but results can be sloppy and require additional research to provide necessary nuance and fact-checking.

Keep in mind robot writers could be misused and amplify the production and circulation of false or misleading information online, which impacts journalism and society.

Push for robot-generated copy to be labelled, and AI transparent in general, as the Council of Europe demands in a June 2021 declaration.

As long as robots are one tool in the toolbox and not responsible for the ethics statements that will guide their adoption, they can, as Beckett hopes, allow journalists to do more creative work. This includes using machine learning to hold governments and corporations to account.

Will this headline make you mad?

Statistical modelling has been used for more than a decade to predict which headlines will garner clicks, using the logic of Search Engine Optimisation, and more recently, emotional resonance to predict engagement on social media platforms. Plugging headlines into an analyser is a low-stakes way to gauge the grab of a headline for a general audience.

Advanced Marketing Institute’s Headline Analyzer, for example, gives a percentage score from 1-100 to a headline’s “emotional marketing value” but no feedback on what is lacking. While the Headline Analyzer Tool evaluates headlines on SEO, readability, and sentiment.

The takeaway: For newsrooms interested in using AI but unsure of the next steps, consider the specific problems they are trying to solve. The answer to the question can guide the next steps--whether it’s hiring someone or using AI-powered software. Headline analyzers are an example of a tool that might be helpful, depending on the specific needs of the newsroom.

From robots to infrastructure

Artificial intelligence, when defined as an autonomous, sentient system, is unlikely. But automation using machine learning is already a part of daily life. Investigative journalists are using machine learning to hold companies and governmental organisations accountable.

The tools and services described in this article are already inside newsroom editorial processes. What the future holds, perhaps is less about whether or not robots will be a part of journalism, but how definitions of what is abnormal--or magical--will shift as norms change.

In the closing session for the AI Journalism Festival on Dec. 1, 2021, Agnes Stenbom, Responsible Data & AI Specialist at Schibsted Group said discussions about applications of AI will soon be obsolete. Just as electricity seemed magical when it first was adopted and now is a banal part of everyday life; so goes AI.

“So we need to invest in the groundwork, to lay a solid foundation,” Stenbom said. “It will be central for news organisations to prepare their people -- not just their tech people -- for the change toward AI as infrastructure.”

Just as electricity seemed magical when it first was adopted and now is a banal part of everyday life; so goes AI.

In other words, in a decade, there won’t be articles bemoaning how the robot writers have stolen reporting jobs. But that won’t be because machine learning and automation aren’t involved in newsmaking. They will be.

But as automation forms infrastructure, the work of the journalist will shift as well. In the meantime, Stenbom suggests newsrooms “vocalise our values and explicitly state and plan for the direction we want to go.”

Sabin Muzaffar, the Founder and Executive Editor of Ananke Magazine, a digital platform for women across the MENA & Sub-continent regions, has only recently begun incorporating AI into newsmaking and production, as shared in a panel on AI and small newsrooms at the AI Journalism Festival.

“As far as my experience is concerned, everyone in the newsroom needs to have ownership [over the use of AI]. Only then can you move positively forward. And you learn every step of the way,” she said.

Thanks to Tadayoshi Kohno, Professor in the University of Washington Department of Computer Science & Engineering, for interviewing with me for background research for this story.

Monika Sengul-Jones, PhD, is a writer in Seattle, Washington, USA. @monikajones,www.monikasjones.com

]]>
The history of data journalism https://datajournalism.com/read/longreads/the-history-of-data-journalism Mon, 13 Dec 2021 06:30:00 +0100 Brant Houston https://datajournalism.com/read/longreads/the-history-of-data-journalism It all started with trying to predict the outcome of a US presidential election.

Many practitioners date the beginning of computer-assisted reporting and data journalism to 1952 when the CBS network in the United States tried to use experts with a mainframe computer to predict the outcome of the presidential election.

That’s a bit of a stretch, or perhaps it was a false beginning because they never used the data for the story. It really wasn’t until 1967 that data analysis started to catch on.

In that year, Philip Meyer at The Detroit Free Press used a mainframe computer (known as big iron) to analyse a survey of Detroit residents for the purpose of understanding and explaining the serious riots that erupted in the city that summer. Decades later, The Guardian in the United Kingdom used some of the same approaches to look at racial riots there and cited Meyer’s work.

Meyer went on to work in the 1970s with Philadelphia Inquirer reporters Donald Barlett and James Steele to analyse sentencing patterns in the local court system, and with Rich Morin at The Miami Herald to analyse property assessment records.

Meyer also wrote a book called Precision Journalism that explained and advocated using database analysis and social research methods in reporting. Revisions of the book, now called New Precision Journalism, have been published since then.

 

Still, only a few journalists used these techniques until the mid-1980s, when Elliot Jaspin in the U.S. received recognition at The Providence Journal Bulletin for analysing databases for stories, including those on dangerous school bus drivers and a political scandal involving home loans.

Jaspin, who had won a Pulitzer Prize for traditional reporting on labour union corruption, also had taken a fellowship at Columbia University to learn how to use data. This was the same university where a journalist and professor Steve Ross had been teaching data analysis techniques for years. By the late 1980s, about 50 other journalists across the U.S., often consulting with Meyer, Jaspin, or Steve Doig of the Miami Herald, had begun using data analysis for their stories.

The use of data by journalists has vastly expanded since 2015.

Aiding the efforts of the data journalists of the 1980s were improved personal computers and a much-needed software—Nine Track Express—that Jaspin and journalist-programmer Daniel Woods wrote to make it easier to transfer computer tapes (that contained nine “tracks” of data) to personal computers using a portable tape drive.

This was a remarkable breakthrough because it allowed journalists to circumvent the internal bureaucracies and delays involved in using mainframes at newspapers and universities and instead do their work at their desks.

In 1989, U.S. journalism recognised the value of computer-assisted reporting when it gave a Pulitzer to The Atlanta Journal-Constitution for stories on racial disparities in home loans. The project was one of the first collaborations on data stories that involved an investigative reporter, a data reporter and college professors.

During the same year, Jaspin established at the Missouri School of Journalism what is now known as the National Institute for Computer-Assisted Reporting (NICAR). Then, in 1990, Indiana University professor James Brown held the first computer-assisted reporting conference in Indianapolis, Indiana and continued them for several years.

 

In the 1990s through early in the 21st Century, the use of computer-assisted reporting blossomed, primarily due to the seminars conducted at Missouri and worldwide by Investigative Reporters and Editors (IRE) and NICAR.

IRE held its first computer-assisted reporting conference in 1993 and after that, the conferences were a project of IRE and NICAR. The growth of computer-assisted reporting was aided by the publication of my book in 1996, the first on doing CAR, "Computer-Assisted Reporting: A Practical Guide,” now in its 5th edition.

I wrote the book so that it could be used as a textbook for university classes, but also for the lone and lonely practitioner in newsrooms that did realise the power of data and thought having a “nerd” in the corner of the newsroom sufficed for what was an ongoing revolution in journalism.

 

After NICAR was created in 1994, training director Jennifer LaFleur and I initiated an ambitious on-the-road programme that eventually included up to 50 seminars a year with the help of colleagues across the country who volunteered their expertise and their time.

The creation of the on-the-road training was bolstered by the advent of the World Wide Web, which helped journalists immensely in their understanding of, and comfort with, the digital world and data. By 1996 word of the U.S. successes had reached other countries, and foreign journalists began attending the “boot camps” (intense, week-long seminars) at NICAR.

In addition, IRE, with the support of the McCormick Foundation, had set up a programme in Mexico City that did data training in Latin America, which was led by the programme’s director Lise Olsen, who travelled and trained throughout the continent of South America.

Going global

While journalists outside the U.S. at first doubted they could obtain data in their own countries in the 1990s, the training showed them how international or U.S. data could be used initially for stories in their countries, how they could build their own datasets, and how they could find data collected and stored by their governments.

As a result of the extensive training efforts, journalists had produced stories by 1999 involving data analysis in an array of countries, including Finland, Sweden, New Zealand, Venezuela, Argentina, the Netherlands, Norway, Brazil, Mexico, Russia, Bosnia, and Canada.

Meanwhile, in London in 1997, journalism professor Milverton Wallace began holding an annual conference called NetMedia that offered sessions on the Internet and classes in computer-assisted reporting led by NICAR and Danish journalists.

The classes covered the basic uses of the Internet, spreadsheets, and database managers, and they were well-attended by journalists from the UK, other European countries, and Africa.

 

In Denmark, journalists Nils Mulvad and Flemming Svith, who had gone to a NICAR boot camp in Missouri in 1996, organised seminars with NICAR in 1997 and 1998 in Denmark.

They also wrote a Danish handbook on computer-assisted reporting, created the Danish International Center for Analytical Reporting (DICAR) in 1998 with Tommy Kaas as president. This led to them also co-organising the first Global Investigative Journalism Conference with IRE in 2001.

CAR also became a staple of conferences in Sweden, Norway, Finland, and the Netherlands, with Helena Bengtsson from Sweden and John Bones from Norway.

In Brazil, the investigative journalism association, Abraji formed in 2002 with training in data journalism as part of its core mission. Two key leaders in data journalism training by Abraji in Brazil were Jose Roberto de Toledo and Marcelo Soares.

Data journalism comes of age

The early years of the 21st century also saw the Global Investigative Journalism Network begin to play a crucial part in the movement, starting with its first conference in 2001 in Copenhagen that offered a strong computer-assisted reporting track and hands-on training in conjunction with sessions on traditional investigative reporting.

Through the global investigative conferences, the use of data quickly spread across Eastern Europe. In Eastern Europe, Drew Sullivan, one of the original NICAR trainers and data administrators, formed the Organized Crime and Corruption Reporting Project, which has become a leader in data journalism.

By 2009, the increasing number of computer programmers and coders in journalism resulted in creation of Hacks/Hackers.

He and Romanian journalist Paul Radu were strong proponents and organisers of data training sessions and projects. Seminars also were given initially in China through the University of Missouri and in India through the World Press Institute, led by John Ullmann, who had been IRE’s first full-time executive director.

Ullmann also oversaw training in Latin America, recruiting me and other NICAR trainers to assist him.

During the same period Doig, a pioneer in CAR and later the Knight Chair in Computer-Assisted Reporting at Arizona State University, travelled internationally to teach CAR, as did additional NICAR training directors — Sarah Cohen, Andy Lehren, Jo Craven McGinty, Tom McGinty, Ron Nixon, Neil Reisner, and Sarah Cohen -- all practising journalists who went onto work at The New York Times, The Wall Street Journal, The Washington Post, and the Miami Herald.

Visualisation of data increases

Visualisation of data in charts and maps had been on the rise for some time, inspired by a map by Doig in 1992 for the Miami Herald. Showing the deep value of data visualisation for analysis, Doig created a map of hurricane wind speeds and building damage in the Miami area after Hurricane Andrew.

The map revealed a pattern of severe property damage where wind speeds had been low. Following up on that revelation, reporters found that shoddy construction and sloppy building inspections had led to the damage.

In 2005, the visualisation of data for news stories got another boost when U.S. programmer Adrian Holovaty created a Google mash-up of Chicago crime data. The project spurred more interest in journalism among computer programmers and in mapping.

Holovaty and his team of coders then created the now-defunct Every Block in 2007, which used more local data for online maps in the U.S., but the project later ran into criticism for not checking the accuracy of government data more thoroughly.

In 2007, the open data movement in the U.S. began in earnest, spawning other such efforts worldwide. The movement increased accessibility to government data internationally, although the need remained to have freedom of information laws to get data not released by the governments.

The use of data by journalists has now become so prevalent it is easier to keep track of the progress.

By 2009, the increasing number of computer programmers and coders in journalism resulted in creation of Hacks/Hackers, which would encourage more sharing between journalists and coders and ease some of the culture clash between the two groups.

Aron Pilhofer, then of The New York Times and now at Temple University, and Rich Gordon from Northwestern University’s Medill School of Journalism, had pushed for creation of “a network of people interested in Web/digital application development and technology innovation supporting the mission and goals of journalism.”

At the same time in Silicon Valley, Burt Herman brought journalists and technologists together. The three then joined to create “Hacks/Hackers.” The result has been an increasing technological sophistication within newsrooms that has increased the ability to scrape data from Web sites and make it more manageable, visual, and interactive.

Another outcome of the journalist-programmer mashup was the new respect among coders for knowing how flawed databases are, and for ensuring the integrity of the data.

As was well-said by Marcos Vanetta, a Mozilla OpenNews fellow who worked at The Texas Tribune: “Bugs are not optional… In software, we are used to making mistakes and correcting them later. We can always fix that later and in the worst case, we have a backup. In news, you can’t make mistakes -- there is a reputation to take care of. The editorial team is not as used to failure as developers are.”

 

More breakthroughs

The years 2009, 2010, and 2011 also were breakthrough years for using data for journalism. In Canada in 2009, Fred Vallance-Jones and David McKie published “Computer-Assisted Reporting: A Comprehensive Primer” with a special emphasis on CAR in Canada.

This was also the year that journalist Simon Rogers launched The Guardian's data blog.

The European Journalism Centre began its data-driven journalism programme that has organised workshops throughout Europe. This led to the establishment of DataJournalism.com for online training courses and other resources.

Journalist Paul Bradshaw became recognised as a pioneer in data journalism in the United Kingdom. In 2009, Wikileaks released its "Afghan War Diaries", composed of secret documents and then the Iraq War Diaries, requiring journalists throughout the world to deal with enormous amounts of data in text.

 

This was followed in 2011 by The Guardian’s impressive series using data and social media to analyse city racial riots in the United Kingdom. Journalist and author Brigitte Alfter then founded the first Dataharvest conference, which is now led by the Arena for Journalism in Europe.

The same year work began in London on the first Data Journalism Handbook (now in a second edition and available in several languages) it was written by a consortium of contributors from around the world.

Also in the United Kingdom, the Centre for Investigative Reporting, led by Gavin MacFadyen, which teamed up in its early days with IRE to offer classes in data journalism during its summer school, ran a strong programme on its own with the assistance of CAR veteran trainer David Donald.

Data journalism in the global south

Meanwhile, at Wits University in South Africa, Anton Harber and Margaret Renn substantially increased the data sessions at the annual Power Reporting Conference, now the African Investigative Journalism Conference.

Code For Africa founder Justin Arenstein and his team also paved the way for data journalists on the continent. In 2012, he launched Africa's largest data journalism and civic tech lab covering stories involving, environment and climate change, women, gender and health/science.

In Asia, journalists in countries including India, Malaysia, the Philippines, and South Korea began using digital tools, especially those for visualisation, and data stories for high impact stories and exchanging techniques and story ideas at GIJN’s biannual Asia conferences.

 

Journalists also began incorporating social media into their investigations more often. One striking story, using social network analysis, was done by journalists in South Korea, who uncovered an attempt by intelligence officers to undermine elections through social media propaganda. Inspiring data-led interactive pieces also came out of publications like South China Morning Post.

In the Middle East, Egyptian data journalist Amr Eleraqi set up the Infotimes in 2012 followed by the Arab Data Journalists' Network five years later. He began teaching the first Arabic-led data journalism training programme of its kind in the region. Meanwhile, IJNET and Arab Reporters for Investigative Journalism (ARIJ) have continued to engage in offering training opportunities to Arab journalists in recent years.

In Latin America, Giannina Segnini, now at Columbia University, led a team of journalists and computer engineers at La Nación in Costa Rica to produce stories by gathering, analysing, and visualising public databases.

Meanwhile in Brazil, Natalia Mazotte from Open Knowledge Brasil launched Escola de Dados (School of Data Brazil chapter), in 2012 to train journalists with their data literacy programme. By 2015, Abraji had created online courses in data journalism. A year later, Brazil's Coda Festival (Coda.Br) launched and grew to become the largest data journalism conference in Latin America.

Across the global south, data journalist Eva Constantaras began to develop training curricula for investigative and data journalism in high-risk environments with limited data access on behalf of Internews. Journalists have benefited from this data journalism training in a range of countries, including Afghanistan, India, Kenya, Kyrgyzstan and Myanmar.

By 2020, the COVID-19 pandemic revealed the breadth and depth of the skills journalists had accumulated.

 

A revolution in journalism

The use of data by journalists and in the digital tools journalists use has vastly expanded since 2015. Journalists have probed deeper into the analysis of unstructured data -- text, video, and sound -- and woven those media into compelling investigative stories.

They have more routinely managed gigabytes of data for stories and organised massive data leaks with agility and become more sophisticated in visualising data through maps, social network analysis or change over time in both newsgathering and presentation.

They have conducted both traditional and innovative surveys to collect data to uncover social injustice. And the education and training -- and the syllabi and curricula at universities -- have become more focused and rigorous, thus producing new generations of data-savvy journalists.

The result has been an ever-growing stream of data-driven stories by small and large newsrooms -- often in collaborations -- that provide not only context and depth to stories, but also real facts, tips, surprises and epiphanies for journalists and their audiences.

By 2020, the COVID-19 pandemic revealed the breadth and depth of the skills journalists had accumulated and throughout the world journalists collected, analysed, and visualised pandemic data on a daily basis, often far exceeding what public health officials offered and, in fact, exposing the shortcomings of the data on which policy and practice were being decided.

The use of data by journalists has now become so prevalent it is easier to keep track of the progress and new directions by major categories. Among them:

  • The collaborations of journalists, sometimes with universities, who use huge datasets, including leaks of data. The collaboration has become nearly a standard practice and the high-profile International Consortium of Investigative Journalists (ICIJ); the Big Local News Project in the U.S. or Connectas in Latin America are just a few examples of ongoing collaborations.

  • The achievements allowed by free software that breaks the income barriers to entry into the field. The software includes a variety of tools to scrape, analyse and/or visualise data and include Google tools, Tableau, Datawrapper and many others.

  • The exploration of using Artificial Intelligence or Machine Learning to discern patterns and outliers for further reporting such as Reuters’ project called Lynx Insights, which uses an automation tool designed to help reporters accelerate the production of their existing stories or spot new ones.

  • The melding and analysis of satellite imagery, open-source video and photographs, social media, crowdsourcing and data for what are sometimes called visual or forensic investigations that are done by groups like Bellingcat.

  • Surveys through emails or mobile phones, such as the "Forced Out” project that a mobile phone survey to create a database from interviewing thousands of displaced people across South Sudan.

These advances, however, are not replacing but augmenting the original uses of computers and data for journalism that began by applying social science methods and statistical and data analysis to government and business corruption, health and environmental stories, and societal issues.

The use of data has broadened over the years from counting instances of incidents and accidents in spreadsheets to using database managers to match apparently unrelated datasets to mapping data geographically and in social networks, to web scraping, to more efficient data cleaning, to better surveys, crowdsourcing and audience interaction, and to text mining with algorithms.

But all of the work is still in the service of finding patterns, trends and outliers that lead to new knowledge and better news stories in the public interest. Over the decades, there also has been much discussion on what to call the use of data for high-quality journalism and various branding efforts to label it.

But whether it is called “precision journalism,” “computer-assisted reporting,” “data journalism,” ‘data-driven journalism,” or “computational journalism,” the good news is that it is not only here to stay, but will continue to become more critical to revealing truths, holding the powerful accountable, and protecting those who otherwise would be exploited.

Further reading

Professor Brant Houston holds is the Knight Foundation Chair in Investigative and Enterprise Reporting at the University of Illinois. Houston teaches investigative and advanced reporting in the Department of Journalism, where he teaches investigative and data journalism. He also is editor of the online newsroom at Illinois, CU-CitizenAccess.org , which also serves as a lab for digital innovation and data journalism.

]]>
Bring in the machines: AI-powered investigative journalism https://datajournalism.com/read/longreads/machine-learning-investigative-journalism Wed, 01 Dec 2021 07:00:00 +0100 Monika Sengul-Jones https://datajournalism.com/read/longreads/machine-learning-investigative-journalism Oodles. Troves. Tsunamis. With data increasingly stored in extraordinary volume, investigative journalists can and have been piloting extraordinary analysis techniques to make sense of these enormous datasets--and, in doing so, hold corporations and governments accountable.

They’ve been doing this with machine learning, which is a subset of artificial intelligence that deepens data-driven reporting. It’s a technique that’s not just useful in an age of big data--but a must.

The unwritten rule about when to use machine learning in reporting is pretty simple. When the humans involved cannot reasonably analyse data themselves--we’re talking hundreds of thousands of lines on a spreadsheet--it’s time to bring in the machines.

Reporters, editors, software engineers, academics working together--that’s where the magic happens.

What is machine learning?

For journalists just getting started, it might be comforting to know that machine learning shares many similarities with statistics. It’s also worth noting that the semantics are a point of contention.

“Reasonable people will disagree on what to call what we’re doing now,” said Clayton Aldern, senior data reporter at Grist who recently co-reported the award-winning series Waves of Abandonment which used machine learning to identify misclassified oil wells in Texas and New Mexico.

Indeed, a running joke is that “AI sells”--another data journalist referenced this image to me to make that point.

The sentiment isn’t unfounded. Meredith Broussard, professor, journalist and author of Artificial Unintelligence: How Computers Misunderstand the World, said in an interview with the Los Angeles Times that “AI” took hold as a catchy name for what was otherwise known as structured machine learning or statistical modelling, in order to expand commercial interest. But there are differences.

“For one, we’re not using pen and paper,” said Aldern, who has masters degrees in neuroscience and public policy from the University of Oxford. “We have the computational power to put statistical theories to work.”

Meredith Broussard is a professor and journalist who authored Artificial Unintelligence: How Computers Misunderstand the World.

That distinction is crucial, argues Meredith Whittaker, the Minderoo Research Professor at New York University and co-founder and director of the AI Now Institute.

Supervised machine learning has become “shockingly effective” at predictive pattern recognition when trained using significant computational power and massive amounts of quality, human-labelled data. “But it is not the algorithm that was a breakthrough: it was what the algorithm could do when matched with large-scale data and computational resources,” Whittaker said.

The AI Now Institute at New York University aims to produce interdisciplinary research and public engagement to help ensure that AI systems are accountable to the communities and contexts in which they’re applied.

Scaling hardly means that humans aren’t involved. On the contrary, the effectiveness of machine learning in general, and for journalism, depends not only on access to quality, labelled data and computational resources, but the skills and infrastructural capacities of the people bringing these pieces together. In other words, newsrooms leveraging machine learning for reporting have journalists in the loop every step of the way.

“[Machine learning] has a big human component […] it isn’t magic, it takes considerable time and resources,” said Emilia Díaz-Struck, research editor at International Consortium of Investigative Journalists (ICIJ), which has used machine learning in investigations for more than five years. “Reporters, editors, software engineers, academics working together--that’s where the magic happens."

When is machine learning the right tool for the story?

Designing and running a machine learning programme is a big task--and there are numerous free or reasonably priced training programmes available for journalists and newsrooms to sharpen their skill sets--we describe the process and training options at the end of this article. But how does machine learning fit into the reporting process? Here are a few of the ways.

Managing overload: Clustering to find leads

When the International Consortium of Investigative Journalists, a nonprofit newsroom and network of journalists centred in Washington, D.C., obtained the files that would make up Pandora Papers--like the other exposés they’d reported including the Panama Papers, Paradise Papers--initially the sheer amount of information was mind-blowing.

“Reporters were overwhelmed,” said Díaz-Struck. Before they could tell stories, they needed to know what was there, and what they didn’t need. To accomplish this, the ICIJ reporters used machine learning to sort and cluster, among other methods. “First, it worked like a spam filter,” said Díaz-Struck, referencing a popular machine learning application, which sometimes uses Bayes’ theorem to determine the probability that an email is either spam or not spam. The task sounds simple but wasn’t easy.

ICIJ used machine learning to help conduct the largest investigation in journalism history, the Pandora Papers.

“[Miguel Fiandor called it] a sumo fight. Big data on one side, and on the other, all of us, the journalists, reporters, software developers, and editors,” Díaz-Struck said.

Eventually, machine learning helped ICIJ cull data into more manageable groupings and together with ICIJ technologies as Datashare and other data analysis approaches, the team handled the big data. In parallel, more than 600 reporters from around the world took on the herculean effort of connecting the dots between reports of tax evasions and dubious financial dealings by hundreds of world leaders and billionaires.

Pointing fingers: Naming past misclassifications

Another popular use of machine learning is to name misclassifications. This was the tact taken in 2015, when Ben Poston, Joel Rubin and Anthony Pesce used machine learning for The Los Angeles Times to determine that The Los Angeles Police Department misclassified approximately 14,000 serious assaults as minor offences over an eight-year period. The misclassification made the city’s crime levels appear lower than accurate.

As far back as 2015, The Los Angeles Times used machine learning to show how the city's police department underestimated the city's crime levels by misclassifying thousands of assaults as minor offences.

Similarly, BuzzFeed News’ investigation of secret surveillance aircraft to hunt drug cartels in Mexico, by reporters Peter Aldhous and Karla Zabludovsky, was a question of classification. The effort, which Aldhous documented in a separate BuzzFeed article and on GitHub, used a random forest algorithm, a well-known statistical model for classification, to identify potential surveillance aircraft.

BuzzFeed News trained a computer to find secret surveillance aircraft by letting a machine-learning algorithm sift for planes with flight patterns that resembled those operated by the FBI and the Department of Homeland Security.

And misclassification was vital in the ICIJ’s Implant Files. This expansive investigation found that medical devices implanted into people’s bodies--such as vaginal netting, copper coil birth control, breast implants, heart monitors, hip replacements, and so on--were linked to more than 83,000 patient deaths and nearly 2 million injuries. Of patients who died, 2,100 people, or 23% of these deaths, were not reported as deaths but more vaguely classified as “device malfunctions or injuries.”

Checking the wrong box has grave consequences, including misleading health authorities about when devices are linked to deaths and preventing the regulators from knowing a product merits further review--to the detriment of future patients. Díaz-Struck explained it took her team months to design and fact-check machine learning for this research. In the methodology article, published in 2018, she explains that text mining, clustering, and classification algorithms were all involved.

They went on to use machine learning to make a second classification, to identify patient gender, an unknown category that was not released by the patient files made available by the Federal Drug Administration in the United States. Many of those who had died or were harmed by implants were women, but not always from “women devices” such as breast implants.

Sometimes applications of machine learning come out of informal conversations.

What were the numbers? Partnering with researchers at Stanford University, Díaz-Struck’s team painstakingly trained a machine to identify the gender of patients who had been harmed, or died, from an implanted medical device using the presence of pronouns or mention of gendered body parts in the notes section of the reports.

After six months of effort, the team of nearly a dozen was able to identify the sex of 23% of the patients, of which 67% were women and 33% were men. A key limitation was the quality of the data, said Díaz-Struck. Nevertheless, the effects of this reporting are not only greater transparency but reforms.

By any other name? Predicting misclassifications

Sometimes applications of machine learning come out of informal conversations. That’s what happened with the Grist and Texas Observer story that predicts the number of unused oil and gas wells in the Permian Basin that will likely be abandoned in coming years. It will cost taxpayers a billion dollars. The story began with no talk of predictions, rather, it was an informal chat between Aldern and fellow Grist journalist Naveena Sadasivam.

“She’s been on the oil beat for ages and when the COVID-19 pandemic hit, the price of oil dropped. It even went briefly negative. When that happens, some of the mom-and-pop companies hit hard times. Would any go bankrupt, she wondered? And what would happen to the wells?” Aldern said. Sadasivam joined Texas Observer reporter Christopher Collins to find out.

Looking over data Sadasivam collected from public records requests, she and Aldern brainstormed “what we could say,” he recalled. They spent time organising it into a structured database, still unsure if there was a story. The dataset features included the production history of all the wells in Texas, plus macroeconomic indicators, employment, geotags, depth, drilling history for decades, and cost of oil over time.

A collaborative Grist and Texas Observer story is pictured above.

“At one point we asked, could we use this to figure out the future? This was a classification problem: which wells might be abandoned in the next couple years,” Aldern said. “And it’s a perfect question for machine learning.”

The results were damning. They predict 13,000 wells would be reclassified from inactive to abandoned in the next four years, costing taxpayers nearly one billion dollars--not to mention the environmental effects of abandonment. Sadasivam and Collin’s on-the-ground reporting corroborated these findings, based on interviews with experts and ranchers who worried, “no one is coming.”

Aldern documented the methodology in an article and shared the data and code in a GitHub file. He also was featured in the Conversations with Data podcast earlier this year.

A screenshot from the story's Github file.

Holding technological black boxes accountable with machine learning

A subversive use of machine learning is holding privatised machine learning accountable. As the commercial rollout of AI has taken hold in the past decade--which has implications for newsrooms as well, which we will document in the second piece in this series--tech companies remain tightlipped about their processes, refusing to allow independent researchers to assess structured machine learning.

Meanwhile, algorithmic predictions have been criticised for reproducing inequalities as Virginia Eubanks, a political science associate professor at the University of Albany, argues in her book "Automating Inequality"; or incentivising--and bankrolling--disinformation campaigns, as Karen Cho reports.

For data journalists who are new to machine learning, it’s possible to follow along the work of others to learn.

The Markup, led by Julia Angwin, is a nonprofit newsroom focused exclusively on “watchdog” reporting about Big Tech. Like other newsrooms featured in this story, The Markup leverages machine learning and other data-driven methodologies to reverse engineer the algorithms or identify misclassifications and publish a “show your work” article and release the data and code.

Maddy Varner, an investigative journalist at The Markup said in an email that they use machine learning for investigations, including a random forest model in their work on Amazon’s treatment of its own brands, which was also described in a letter from Angwin, a story that took a year to break.

Above is a screenshot from The Markup homepage and its latest investigation examining Amazon's treatment of its own brands.

Transparency builds trust. “It is very important to not just to say what you know but explain why you know it,” said Aldhous, who explained that transparency is a cornerstone value at Buzzfeed News. “The greater the ability to see under the hood of the methods, the better. It is like, this is how it works. This is why we have that number. This is why we think that’s a spy plane.”

No need to reinvent the robot

If getting started sounds daunting, one of the benefits of data science is the open-source community, said Aldern. Data journalists share code and training data on GitHub, where other data journalists or data scientists can take a look.

Don’t be afraid to copy-paste. Borrow tried and true algorithms for logical regressions or decision trees. For data journalists who are new to machine learning, it’s possible to follow along the work of others to learn.

But reporting won’t be fast. Lucia Walinchus, executive director of the non-profit newsroom Eye on Ohio and a Pulitzer Center grantee, has spent more than six months using machine learning to analyse public records on housing repossession in Ohio. The project seeks to understand, mathematically, what makes land banks repossess some homes that are behind on taxes, but not others.

It's an open secret in any data story that the majority of the work is getting the data into order

“It’s the perfect problem for software,” she said. Though machine learning is only part of the story and doesn’t replace investigative on-the-ground research. Her inaugural machine-learning investigation is slated for publication in the coming weeks.

Resource-strapped newsrooms can consider partnerships with academics or companies. The ICIJ has partnered with Stanford University and independent companies to address particularly gnarly data problems while maintaining journalistic independence--crucial when dealing with sensitive materials for a big story that hasn’t yet been broken.

The ICIJ doesn’t outsource the work of training data to ensure accuracy, though they did use a machine learning tool called Snorkel to help classify text and images. Outsourcing the human work of labelling to platforms such as Amazon’s Mechanical Turk, which relies on humans who are paid pennies, has raised ethical concerns.

Data journalists can also be mindful of criticism about the costs of partnerships with tech companies, as Whittaker writes.

When independent journalists or academics need tech companies to access the computational power or intellectual resources to conduct research, those companies get to have the final say on decisions about what work to do, to promote, or discuss. “In the era of big data, journalists are not going to disappear, they are more essential than ever,” said Díaz-Struck.

Resources to master machine learning

To ramp up skills, there are free training programmes available. At the Associated Press, Aimee Rinehart is leading a new effort to expand local news organisations understanding and use of AI tools and technologies, funded with $750,000 from the Knight Foundation’s AI effort. News leaders in U.S. newsrooms can take a survey to inform the curriculum of an online course designed by AP; the survey closes in early December 2021.

After running the course, AP will partner with five newsrooms to identify opportunities for AI and implement those strategies. This initiative follows on the heels of the London School of Economics Journalism AI project funded by Google News Initiative, which also offered a free course on AI for journalists.

Journalists can sign up to data journalism bootcamps run by Investigative Reporters and Editors.

Investigative Reporters and Editors run data journalism bootcamps to teach hands-on technical skills to turn data into accurate, interesting stories. These trainings are not free, but prices vary based on the size of the newsroom, with scholarships available, as well as discounts for students and freelancers. Programmes support journalists to sharpen basic to advanced skills in spreadsheets, data visualisation and mapping, SQL, and coding in R and Python. Journalists should be members of IRE to enrol.

Data journalists can bootstrap their own training program by learning from and participating in machine learning competitions based on over 50,000 datasets, run by Kaggle. While not specifically designed for journalists, the competitions can be valuable and come with three-figure prizes (in U.S. dollars). A Google account is required.

How it works: Machine learning in a nutshell

Let's run through the machine learning process. The basic tasks include the following:

1. Assemble data. "It's an open secret in any data story that the majority of the work is getting the data into order," said Aldern. Data can be public, garnered from public records requests, scraped, or shared from an external source. Consider the questions you'd like to use the data to answer.

2. Identify labels and features of some data to build a statistical model. Criteria for identifying features might be drawn from the investigation. For instance, for the query on whether inactive oil wells in Texas and New Mexico were misclassified and could be soon abandoned, Aldern used state agency definitions of "orphaned" and "inactive" to label data. This intel was gleaned by Naveena Sadasivam and Christopher Collins, reporters on the oil and gas beat.

3. Test the model to avoid overfitting or bias. Models should make generalised accurate predictions. One trick to perfecting a model's performance is to divide the training dataset in half. Use the first half to train the model and the second to evaluate the accuracy trained model. Tweak the model based on the results of the test run on the second labelled dataset.

4. Analyse the unlabeled data. This step will leverage the trained data to provide an answer to the question you are asking of the remaining data: Which inactive oil wells could be misclassified as orphaned? Which files are spam? Which device reports have been misclassified as not causing harm? The methodology often relies on processes derived from statistical modellings such as linear regression, decision trees, logical regression, or clustering. It is written in programming languages such as R, Python or Julia.

5. Corroborate results. Aldern does this by "trying to prove myself wrong." To check the machine learning results, data journalists interviewed for this piece will ask data scientists affiliated with universities to independently review the results. Best practice also includes writing a methods article (as Aldern did here), along with sharing links to GitHub repositories. Finally, boots-on-the-ground reporting will substantiate results.

Above shows a flowchart depicting how data journalists can label data to create learning instructions for machines so they can make useful predictions about unlabeled data. Source: Monika Sengul-Jones.

Key words in machine learning

:LABEL. The label is the thing that will be predicted later, the dependent variable, the y variable in linear regression. A label could be a noun or an event. Spam. Spy Plane. Orphaned. Dead.

:FEATURES. The features are the special things in the labelled item that will be used to define future undefined labelled items; otherwise known as independent variables. For instance, a feature might be pointy ears. A style of email address. Turn direction. Ownership status. Smoking habits.

:CODE. The algorithm used to analyse the data. Often, a version of a tried-and-true statistical model such as linear regression, decision tree, logical regression, or clustering but written in programming languages such as R, Python or Julia.

Monika Sengul-Jones, PhD, is a researcher, writer and expert on digital cultures and media industries. She lives in Seattle, Washington, USA. @monikajones, www.monikasjones.com

Thanks to Caroline Sinders, a machine learning consultant, for being interviewed for background research for this piece.

]]>
The power of data storytelling for lifestyle journalism https://datajournalism.com/read/longreads/the-power-of-data-storytelling-for-lifestyle-journalism Mon, 01 Nov 2021 08:00:00 +0100 Sabrina Faramarzi https://datajournalism.com/read/longreads/the-power-of-data-storytelling-for-lifestyle-journalism Data is like dust. It’s everywhere, and we often spend most of our time cleaning it up. Like dust, each one of us leaves a trace of our data for every move we make online.

Data journalism has long been confined to the areas of news, politics, business and finance. What if that dust (sorry, data) could be cleaned, gathered and shaped into a picture to tell more stories about how we live, too?

Lifestyle journalism has often been regarded as ‘fluff’ journalism, but it’s not hard to see its growth over the last two decades. From newspaper supplements to indie publishing and even social media, lifestyle journalism offers readers a guide for how to live and the stories behind the news.

Its complex dynamics have been a dance between traditional journalistic practices and consumerism, entertainment and cosmopolitanism. It has made sense of everything from food to fashion, technology to sexuality, and identity to houseplants.

For those who still browse the shelves of newsagents, you’ll find a myriad of niche magazines -- everything from gardening to photography, knitting to computer programming.

Online, these magazines are limitless, and so too are the ways that data-driven stories can reach different kinds of readers. Lifestyle journalism is all about curiosity, behaviours and the nuances of human life.

Just like lifestyle journalism, data journalism rose from scrappy beginnings.

It is “a genre of journalism that constitutes a significant and growing portion of mainstream journalism yet continues to be an under-researched field,” writes Dr Lucia Vodanovic in her book ‘Lifestyle Journalism: Social Media, Consumption and Experience’.

At its core, the lifestyle journalist is a hybrid entity and can “adopt different functions such as marketer, service provider, friend, connector, mood manager, inspirer and guide”, explains Vodanovic.

Lifestyle journalism is anything about the human experience, so it lends itself well to data storytelling in the way that it can provide wider context behind the human experience. It is also an area of journalism that has been historically difficult to define (though it has been analysed as “the journalistic coverage of consumption, identity and everyday life”).

This means it is possible to explore new formats, structures and techniques, which is an exciting proposition for data journalists looking to cover lifestyle stories and lifestyle journalists looking to use data in their stories.

What makes data storytelling in lifestyle journalism so captivating is that it is fertile ground for new forms of data analysis. Rarely are readers expecting complex data analysis in such pieces, yet lifestyle journalists often use data to inform those stories, whether consciously or not.

Mapping our lives through data

So where does data storytelling sit within this? Just like lifestyle journalism, data journalism rose from scrappy beginnings. While journalists have long used data in their stories, one of the first forays into computer-assisted reporting dates back to 1952, when CBS attempted to predict the American mid-term elections that year with the Remington Rand Univac Electronic Computer.

This was also the same decade when lifestyle journalism got its start and when women’s style pages began appearing in newspapers. At the time, it was one of the few places for women journalists to flex their skills. While most journalists were ignoring computers, data journalism was born. While most newspapers were ignoring women, lifestyle journalism was born.

It is also no surprise that one of the first-ever data visualisations was by a woman (Florence Nightingale), or that lifestyle journalism was also born amongst women. One journalist famed for merging the two is Mona Chalabi.

With her signature data illustrations on Instagram, she covers all kinds of topics, including lifestyle issues too. One of her most lifestyle-led pieces for The Guardian examined the decline in smoking against the rise in vaping amongst high school students.

Chalabi’s visual style and the growing interest in hand-drawn graphs and charts nods to the messy nature of both lifestyle and data journalism.

The above Guardian datablog by Mona Chalabi shows the percentage of high schoolers who smoke cigarettes has more than halved since 2011, falling to just 8%.

For those more interested in the exploratory data experience, The Shape of Dreams may draw some inspiration.

By visually diving into Google searches for the interpretation of dreams, data visualisation designer Frederica Frangapane provides cultural insight into what and how we dream.

Spanning seven languages, the analysis shows that Japanese people dream about emotions more often, while Russians dream about food more often.

Google search queries from Frederica Frangapane's The Shape of Dreams cover questions like: ‘What does it mean to dream about…’, ‘Why do I dream about…’, ‘Meaning of dreaming…’

Data and the consumer

Because lifestyle journalism is so linked to consumerism, there is a lot of data available to explore those topics. From scouring online shopping sites to diving into market research reports and surveys, reams of data exist about consumption and the way people shop.

No other publication has formalised their style of data-driven lifestyle journalism more than The Pudding, an online “publication that explains ideas debated in culture with visual essays.”

One of their pieces, ‘The Naked Truth’, looked at the names of over 6,000 ‘complexion’ products to explore bias in the beauty industry. Their analysis found that 80% of all shades with the word “natural” in them were on the lighter end of the scale.

A screenshot from The Pudding's The Naked Truth interactive data piece.

We can also tackle the other side of consumerism through data storytelling too, as well as topics about data. One story I wrote for Vogue Business explored consumers' lack of trust with how brands use their shopping data and the companies trying to tackle that.

Other ways can be through physicality, such as this story, using Google’s COVID-19 Community Mobility Reports about how people were moving to and from retail destinations pre-pandemic, during, and post-lockdown restrictions.

Lifestyle journalism is all about curiosity, behaviours and the nuances of human life.

Entertainment's data diversity report

For journalists with a culture beat, data storytelling can provide a wider picture beyond the pictures. This is especially interesting for audiences interested to see how Hollywood is performing with inclusion and diversity in front of the camera.

This piece by Sky News earlier this year looked at the diversity of Oscar winners, from gender, age, race and ability and as a way to trace, track and hold people to account for social issues.

The above chart by a Sky News analysis shows Oscar-winning actresses are on average nine years younger than their male counterparts.

Annual benchmarking those on the big screen is another way to track progress. Interestingly, the BBC published an article revealing how 2020 was Hollywood's most diverse year ever.

Another key indicator is speaking parts. Text analysis, a favourite among data journalists, can reveal how many words women spoke in a film versus men.

That angle led the The Pudding to compile the number of words spoken by male and female characters across roughly 2,000 screenplays. One of the findings showed 22 of 30 Disney films have a male majority of the dialogue.

The above screenshot shows a chart from The Pudding's Film Dialogue piece showing 2,000 screenplays broken down by gender and age.

The rise of service journalism

For decades, news outlets and magazines have put readers' needs first by commissioning content that helps people solve everyday problems, known as service journalism.

You can find it in consumer-oriented features to product reviews, or well researched how-to pieces. From guidance on personal finance to dinner ideas, this type of journalism is everywhere if you look close enough.

In these pandemic times, the complexity of everyday life has led editors to place more value on service journalism coverage sitting alongside hard news and opinion pieces.

"Service journalism must no longer be marginalized as some lesser form of the enterprise: All journalism should be service journalism," writes Jeremy Olshan is editor-in-chief of MarketWatch in Nieman Lab's 2020 predictions for journalism. Ironically, that became the year where both data and service journalism hit the mainstream.

A screenshot from The New York Times article explaining how two of its reporters deliver "service journalism: answers to questions people are asking, and solutions to problems they are experiencing."

Data is critical when it comes to writing such articles. By digging into Google Trends data, journalists can understand what users are searching for and adapt coverage accordingly.

Finally, audience analytics from website traffic, social media engagement and readers' comments or questions inform editors what articles did well and what to commission next. Analytics data also provides insight into reader personas for the longer term.

Where most journalists use data to think literally, lifestyle journalists can use data to think laterally.

Go-to data sources

A lot of everyday data journalism is about presenting data analysis as a source of truth for a particular story or subject. Where most journalists use data to think literally, lifestyle journalists can use data to think laterally.

A lack of data is one major challenge lifestyle journalists often face, and this is usually because no research exists on such everyday topics. So what are the possibilities for data sources when covering lifestyle?

There are many creative ways to get around these data deserts. Often, it’s about relying on existing data from brands and other private organisations, usually through surveys done for their own marketing purposes. But of course, those insights should be taken with a pinch of salt.

For example, suppose you wanted to write about the rise of artisan gin but say, the data to show that came from an alcohol conglomerate. In that case, they may be trying to engineer a self-fulfilling prophecy. The same goes for writing a story about the rise of resale in fashion. If the data comes from a resale platform or a brand beginning to invest in resale, it may not be a legitimate data source.

Other routes can be through scraping online sources such as social media. For a travel piece I did for The Guardian back in 2018, I scraped travel-related Instagram hashtags and analysed them to plot emerging travel trends (rather than those that had already peaked).

Where the data doesn’t exist, you have to come up with creative ways to measure it.

For other stories, I have used everything from Freedom of Information (FOI) requests to scouring online resources like Kaggle to cross-analyse and combine different datasets. National statistical offices, government websites and other open data platforms are a fantastic place to start, but how useful these sources depend on your story.

Polls are also a helpful resource because this data is often freely available -- either found on publisher websites or through sites like YouGov. But it’s important to keep in mind the variables used and who the audience is. For shorter, faster news pieces they can be useful.

Gathering the data yourself

For topics where the data doesn’t exist at all, sometimes you have to come up with creative ways to measure it. Tools like Google Surveys allow you to generate your own data, but that comes at a cost and can be tricky if you aren't sure how to properly design a survey.

Sometimes, you might just need to step away from the computer and get out a ruler.

One of my favourite creative data examples of ways to get around these data deserts for lifestyle topics can be illustrated through one of The Pudding’s stories which compared pocket sizes by gender. They measured the pockets in both men’s and women’s jeans in 20 of the US’ most popular brands.

They found out that on average, the pockets in women’s jeans are 48% shorter and 6.5% narrower than men’s pockets. The authors did this by measuring 80 pairs of jeans themselves.

Above is a screenshot from The Pudding's piece on the size of women's vs men's jean pockets.

Data: Too much of a good thing?

Lifestyle journalists are well-positioned for using data in their stories. Where lifestyle journalists start with a question (will life ever be normal again after coronavirus?), data analysis also nearly always begins with a question. Both are about abstraction and testing ideas to validate hunches. Both use their intuition to guide through facts.

But too much focus on data in newsrooms can also be counterintuitive. In a 2019 piece for The Business of Fashion, titled ‘Fashion Media Is Addicted to Data’, journalist and editor Amy Odell writes that “many fashion and lifestyle publishers are anxious for a quick fix...The instincts and talents of editors have been washed away by a flood of data in a desperate scramble for more clicks.”

Though she refers to the obsession with audience engagement analytics, that appetite for data has also spilt over into stories, and Odell warns that this strategy is problematic.

“Many of the biggest traffic spikes are driven by unique stories that do well precisely because they are unique….Constantly trying to repeat that which did well easily dissuades editors from pursuing or dreaming up new, creative ways to engage their audiences. And if editors and their writers aren’t doing that, it won’t be long before they’re replaced by machines.”

Finding your niche

How does one become a data journalist who writes about lifestyle? I began my career as a trends forecaster, which in itself is a hybrid role using qualitative, quantitative and intuitive methodologies.

It is the modern-day social anthropologist but with more hard facts. In ‘Lifestyle Journalism: Social Media, Consumption and Experience’, my chapter -- Agents of Change -- talks about the harmony and tensions of being both a trends expert talking to a business audience, and a lifestyle journalist talking to the consumer -- albeit on the same topics.

Both the trend forecaster and the lifestyle journalist are mediators, as is the data journalist. Our role(s) is to make sense of the world for the reader.

Good data journalism takes time -- to deny it that you lose accuracy, and to go only for accuracy might mean you lose storytelling.

My background in running and analysing surveys, and my qualitative skills in interviewing and cultural ‘brailing’, set me up with the basic skills to become a data journalist.

Data storytelling is most interesting when bringing together disparate ideas, which is often why data storytellers are those with either several hobbies or portfolio careers. For me, my hobbies of drawing, painting and design set me up to understand visualisation.

But it was only during an internship I did back in 2017 at the T Brand Studio in London, the branded content arm of The New York Times, that things clicked for me. That year, they had published Journalism That Stands Apart, a report of the 2020 Group.

In it, they highlighted plans for the future of journalism, which included becoming more visual, more digitally native, growing new approaches to features and service journalism (read: lifestyle), and building more reader engagement. Data storytelling is part of all that.

Making it in data journalism

There are plenty of ways to get into data journalism or data storytelling. In a previous job, I was writing, analysing and distributing polls across numerous publisher websites, helping editors and audience development managers understand their readers and what topics they wanted to learn more about.

The polls ranged from Brexit to Brussels sprouts, but all require range -- something that both data journalists and lifestyle journalists have in common, interrogating datasets like lifestyle journalists interrogate sources.

For newsrooms, data journalism can often act as the bridge between hard news and soft news.

“We have found that hard and soft news attributes in data journalism often appear close together in hybrid forms, which accentuates that processes of innovation within journalism institutions are often amalgamated with genre hybridization: the mixing of news styles and journalistic artifacts characterized by categorical "in-betweenness",” write academics Andreas Widholm and Ester Appelgren.

Data is often lost in translation.

Data journalism is, by its very nature, very collaborative. We often see multiple bylines when it comes to data journalism work, so learning how to work well with others matters.

Lifestyle journalists wanting to get into data journalism should begin by building partnerships with data analysts, visualisation designers and programmers to start mapping out stories. Not all newsrooms have big budgets, so building teams of freelancers is key.

Good data journalism takes time -- to deny it that you lose accuracy, and to go only for accuracy might mean you lose storytelling. One issue I’ve seen is that modern journalism workplaces don’t make time nor space for this, missing the very core of what data journalism is and what it can do.

Data is often lost in translation. Through lifestyle journalism and telling stories about the human experience, we can begin to make sense of it.

I like to think about data journalism like the Dust Carpet by artist Igor Eškinja - intricate, abstract, fragile and always a little bit messy. Just like the lifestyles we’re reporting on.

Sabrina Faramarzi is a journalist, speaker and founder of data viz agency Dust in Translation. You can find her on Twitter, LinkedIn & Instagram.

]]>
Investigating troubling content on Amazon https://datajournalism.com/read/longreads/investigating-troubling-content-on-amazon Thu, 02 Sep 2021 07:30:00 +0200 Jonathan Gray Marc Tuters Liliana Bounegru Thais Lobo https://datajournalism.com/read/longreads/investigating-troubling-content-on-amazon Since the start of the COVID-19 pandemic in early 2020 there have been widespread concerns around what the World Health Organisation has described as an “infodemic” of misinformation and conspiracy claims.

Many journalists and media outlets have reported on how problematic claims have spread online, including on major social media platforms such as Facebook, Twitter, YouTube and Spotify, as well as emerging “alt tech” platforms.

We have recently been exploring how journalists, researchers and students can work together to use digital methods and data for investigating the infodemic.

Over the past year we’ve been working with institutions associated with the Public Data Lab – an interdisciplinary network exploring what difference the digital makes in attending to public problems – on projects and investigations into conspiracy content on Amazon. This resulted in stories such as:

These collaborations draw on approaches that are documented in the new edition of the Data Journalism Handbook (in a section on “investigating data, platforms and algorithms”) which two of us co-edited, as well as in the Public Data Lab’s Field Guide to “Fake News”.

In this piece, we’ll take a behind the scenes look at how digital methods and data can be used to investigate troubling content on Amazon. As many of the lines of inquiry below are based on data that has been obtained through manual analysis and scraping of content and interfaces rather than through Application Programming Interfaces (APIs), this may be taken as a contribution to “post-API” investigations, as well as how journalists, researchers and students can collaborate around digital investigations.

These projects were undertaken with a group of researchers and students at the Department of Digital Humanities, King’s College London; the Digital Methods Initiative, the Open Intelligence Lab and the Media Studies Department at the University of Amsterdam; DensityDesign Lab in Milan, and other institutions associated with the Public Data Lab, as part of an ongoing initiative on “engaged research-led teaching”.

The collaboration process involved researchers providing initial “project briefs” and “project pitches” in consultation with journalists and others interested in investigating infodemic, each proposing steps and approaches for following different lines of inquiry such as de-platforming, monetisation, conspiracies and spirituality, creative hashtagging and conspiracy aesthetics.

These briefs served as the basis for a first round of student projects at King’s College London in autumn 2020. The Amazon books project was further developed at the Digital Methods Winter School 2021 in Amsterdam.

We then provided journalists with packages of materials from these projects, including slide decks, visualisations and observations, along with a folder of documented datasets and research notes. Our journalistic collaborators then built on and incorporated these materials into their reporting, sometimes with further exchanges and follow-on research. We have indicated below how the different recipes surfaced in reporting with links and quotes.

What counts as a COVID-19 conspiracy book?

We began the project by looking for books with promoted COVID-19 conspiracy claims. But what counts as a COVID-19 conspiracy book?

The team was lucky enough to be working with conspiracy researcher Peter Knight from the University of Manchester, who suggested five key features of conspiracy based on research in this area:

Five key features of conspiracies from Peter Knight:

  1. Nothing happens by accident (deliberate secret planning)
  2. Nothing is as it seems (appearances are deceptive, official version is a lie)
  3. Everything is connected
  4. Tone/style of conspiracy theories (e.g. apocalyptic, manichean)
  5. Assumption of going against received wisdom

While conspiracy books are a thriving literary genre, how might one identify books which are mainly about promoting COVID-19 conspiracy claims? We looked for the presence of pandemic related keywords and themes, the presence of actors or entities that were related to COVID-19 conspiracy theories, and how books were being read as connected to the pandemic.

Based on this we considered three ways in which books can be considered to be about COVID-19 conspiracies:

1. According to writers – books which are explicitly written about COVID-19 conspiracies;

2. According to readers – books which are (sometimes retrospectively) read as being related to COVID-19 conspiracy claims;

3. According to algorithms – books that are algorithmically associated with COVID-19 conspiracies through an interplay between recommendation features and user practices.

As we wanted to focus on books that were explicitly conspiratorial, we took the decision to not list books that were sceptical of the pandemic, vaccines or lockdowns. This was because this scepticism did not qualify as conspiratorial as such, as per the features mentioned above. However, such books made plenty of appearances during the course of our investigation.

What kinds of COVID-19 conspiracy books can be found on Amazon?

Based on this narrower, writerly understanding of a COVID-19 conspiracy books, we queried for keywords, followed “related books” and made an initial list of 18 conspiracy books on Amazon.com.

Six of these have subsequently been taken down or are no longer available. Over the course of the project, we were subsequently able to identify others by following this initial set of books.

What are these books about? We experimented with several techniques for highlighting key themes based on the analysis of the full texts of the books, such as through word trees (which can be generated using this free tool):

While these graphics give an indication of prominent themes and concerns, there is no substitute for reading the books. So a group of researchers and students read the full texts of the books, created in-depth profiles of each of the books, as well as examining all of their comments and reviews on Amazon sites.

We also explored which kinds of themes were most prominent across the books, drawing on these readings and profiles, as well as thinking along with a list of COVID-19 conspiracy themes which had been identified in the course of the Infodemic research project. This was used to create a network showing which of the books shared which themes.

These lists and the approach of identifying and discovering COVID-19 conspiracy books through the following recommendations were taken up by POLITICO Europe:

To determine how widespread disinformation was on Amazon, POLITICO Europe worked with researchers from King's College London and the University of Amsterdam who started with 16 widely available QAnon and COVID-19 conspiracy books on the e-commerce giant. The academics then relied on the company's own recommendations — based on automated algorithms that serve up other titles that may be interesting to its customers — to compile a list of more than 100 books with ties to disinformation and conspiracy theories.

POLITICO Europe also conducted a separate review of Amazon's U.S., British, German and French sites by searching for books associated with QAnon and COVID-19, and similarly used the company's own recommendations to put together a list of 70 different books.

While the English-language online marketplaces had the most conspiracy theory content, Amazon's German and French versions also listed reams of such material, often associated with local groups like Germany's right-wing identitarian movement and Didier Raoult, the French doctor who promoted an antimalarial drug to treat COVID-19.

Recipe 🥣

  1. Query for COVID related keywords and make an initial list of problematic/conspiracy books.
  2. Look at recommendations associated with these books on Amazon sites to find more books.
  3. Qualitative analysis to classify them.
  4. Identify key themes and visualise using Gephi, an open source network analysis tool.

How and where does COVID-19 conspiracy content appear on Amazon?

How do COVID-19 conspiracy books appear on Amazon websites? In order to explore this, we created multiple fresh browser profiles on multiple devices and used multiple web proxy servers so that we could also understand the extent to which results were being personalised based on location, IP, cookies and previous browsing activity. We queried for several topical keywords such as “covid”, “covid-19”, “coronavirus”, “vaccine” and “lockdown” and recorded what we found in the first page of results (normally returning a maximum of 16 items).

As well as identifying a number of conspiracy books listed in the first page of these search results (📕), we also found that the most common way to encounter these books was through Amazon’s recommendations from the other listings (👉🏼). There were also many conspiracy themed comments (💬), including some which indicated the book which was not about COVID-19 conspiracies according to the writer was nevertheless being read as a source book by conspiracy book readers.

Drawing on this work, BuzzFeed News reporters Craig Silverman and Jane Lytvynenko commented in their piece:

[...] COVID conspiracy books have appeared on the first page of search results for basic terms like “covid,” “covid-19,” and “vaccine.” Amazon also recommended conspiracy books when the researchers browsed non-conspiratorial books about the virus and related topics.

Recipe 🥣

  1. Query for COVID-19 related terms.
  2. Obtain a list of books per website.
  3. Interface analysis to see whether there is conspiracy content, taking notes and screenshots on book content, recommendations, comments and other material.

Which platforms are mentioned in the book reviews?

We were curious whether the comments and reviews on COVID-19 conspiracy books might give a sense of where readers were coming from or going to. This might also suggest which kinds of platforms are hosts to more lively COVID-19 conspiracy communities.

We queried the comments of our list of COVID-19 conspiracy books for the names of various social media platforms, websites and alt-tech platforms, and found that YouTube was most frequently mentioned.

Recipe 🥣

  1. Start with a seed list of conspiracy books.
  2. Obtain associated metadata.
  3. Explore and query for platform names and other online spaces / alternative platforms.

To what extent does the availability of COVID-19 conspiracy books differ among national Amazon marketplaces and subsidiaries?

We also looked into differences in availability amongst the list of books that we started with across Amazon marketplaces and subsidiary companies. We found that there were significant differences in availability across these websites and services.

While in some countries none of the books was available (e.g. UEA, China), in others more than two-thirds were still available (e.g. Canada, France, Germany, Italy, Japan, Spain, United States). All of the books were available on GoodReads, and none were available on Audible.

When books were taken down, we found that Amazon sites and services displayed different pages and messages. Sometimes this would include recommendations for other COVID-19 conspiracy books or query terms.

As commented in the BuzzFeed News piece:

[...] this feature is not consistent across Amazon’s international stores. Of its English-language stores, Amazon Canada and Singapore did not display government resources when searching for “covid” or “vaccine.” The company’s store in the United Arab Emirates showed them when searching for “covid” but not for “vaccine.”

Fast Company incorporated the graph into their piece together with the comment that “the result is a moderation process that often appears haphazard and unclear”.

Recipe 🥣

  1. Start with a list of conspiracy books.
  2. Query for each of them on Amazon national sites.
  3. Analyse differences in moderation practices and what pages are displayed to visitors per book per site.
  4. Check its availability on Goodreads and Audible.
  5. If the books are unavailable, check the recommendations.

How do Amazon recommendations bring users to and from COVID-19 conspiracy books?

Given the role of recommendation algorithms in displaying COVID-19 conspiracy books in relation to innocuous query terms (e.g. “covid”, “coronavirus”), we decided to explore this in more depth.

Starting with the query "covid" on Amazon.com, we made a list of all of the books that appeared in the search results, and then expanded it to include all of the books that appeared in the recommendations associated with these books, and then again to include all of the recommended books associated with those books.

Based on these lists, we made a network showing how books were associated through their recommendations. Conspiracy books and other books sceptical of COVID-19 appeared at the centre of the resulting network of recommendations.

This further suggests that Amazon's recommendation features and algorithms play a significant role in facilitating access to conspiracy books, taking the first page of search results for “covid” on Amazon.com as a starting point.

Drawing on this work, POLITICO Europe commented:

The company's recommendation engine, an automated tool that offers up other titles people may be interested in, based on others' purchasing histories, similarly pushes people towards conspiracy theories and disinformation.

In their piece, BuzzFeed noted:

The problem highlights how Amazon’s search and book promotion mechanisms often direct customers to COVID-19 conspiracy titles.

Recipe 🥣

  1. Take a list of books appearing in search for “covid” on Amazon.com.
  2. Obtain books suggested through “customers also viewed” recommendations.
  3. Obtain recommendations associated with a list of books from the last step.
  4. Combine and organise data for visual network exploration (Table2Net may be helpful here).
  5. Visualise with Gephi, an open source network analysis tool.

How do Amazon-owned websites facilitate engagement with COVID conspiracy content?

We also noticed that there were other features of Amazon websites that were involved in the promotion of conspiracy books.

One prominent feature was the “best-seller” ranking. We looked at which of the COVID-19 conspiracy books featured in best-seller rankings and discovered that quite a few were included in the top 50 rankings across a number of categories.

COVID conspiracy books in category rankings

As the BuzzFeed News piece commented:

Despite being filled with misinformation about the pandemic, Icke’s book "The Answer" at one point ranked 30th on Amazon.com’s bestseller list for Communication & Media Studies. Its popularity is partly thanks to the e-commerce giant’s powerful recommendation algorithms that suggest "The Answer" and other COVID conspiracy theory books to people searching for basic information about the coronavirus, according to new research shared exclusively with BuzzFeed News.

Fast Company’s piece commented:

One top-ranked book that promises “the other side of the story” of vaccine science is #1 on Amazon’s list for “Health Policy.” Next to it, smiling infants grace the cover of the top-selling book in “Teen Health,” co-authored by an Oregon paediatrician whose license was suspended last year over an approach to vaccinations that placed “many of his patients at serious risk of harm.” Another book, Anyone Who Tells You That Vaccines Are Safe and Effective Is Lying, by a prominent English conspiracy theorist, promises “the facts about vaccination — so that you can make up your own mind.” There are no warning notices or fact checks—studies have shown no link between vaccines and autism, for instance—but there are over 1,700 five-star ratings and a badge: the book is #1 on Amazon’s list for “Children’s Vaccination & Immunization.”

Tips for doing digital investigations together

Beyond the specific outcomes and outputs from these digital investigations, we are interested in how researchers, students and journalists can work together on digital investigations, which is also discussed more extensively in the Data Journalism Handbook. We have also have been exploring this as part of work on “engaged research-led teaching”.

We conclude with a few things that we learned through the process of organising collaborative digital investigations this year:

  • Following many years of organising and contributing to Digital Methods Winter and Summer Schools in Amsterdam, we often use “data sprints” or workshops in order to set aside time for focused group work, accompanied by moments for collective reflection and feedback (e.g. sharing slides with collaborators to see what would be of most interest to them).

  • Rather than assuming that questions and problems are fixed from the outset, it is highly likely that they will change over the course of your project.

  • There is much to be gained from taking a qualitative and interpretive approach towards your material rather than mainly focusing on statistical or computational techniques. This is especially when it comes to thinking about how things are sorted out on the web and social media (e.g. what counts as a conspiracy theory).

  • You may wish to consider keeping research diaries and documenting datasets so it is possible for others to retrace your steps later on.

  • In complement to API-based tools and applications, manual analysis of user interfaces and gathering your own data can provide many insights for digital investigations.

  • Having worksheets and "recipes" for doing digital investigations on various platforms can help to get things started and can be adapted as needed and shared with collaborators.

  • Visualisations can be part of the process of data exploration rather than just to summarise or present findings. Simple visualisations can be prototyped in spreadsheets (e.g. rankings, emoji graphs) or with free open source tools such as Raw Graphs, Datawrapper or Gephi.

This approach aims to support making space in universities and classrooms for experimental, creative and critically engaged digital investigations, without taking for granted the questions, data, methods, materials and means through which they are produced.

Further reading and resources:

Dr. Jonathan Gray is Lecturer in Critical Infrastructure Studies at the Department of Digital Humanities, King’s College London, where he is currently writing a book on data worlds. He is also co-founder of the Public Data Lab; and Research Associate at the Digital Methods Initiative (University of Amsterdam) and the médialab (Sciences Po, Paris). More about his work can be found at jonathangray.org and he tweets at @jwyg.

Liliana Bounegru is Lecturer in Digital Methods at the Department of Digital Humanities, King's College London. She is also co-founder of the Public Data Lab and affiliated with the Digital Methods Initiative in Amsterdam and the médialab, Sciences Po in Paris. More about her work can be found at lilianabounegru.org. You can follow her on Twitter at @bb_liliana.

Marc Tuters is an Assistant Professor in the University of Amsterdam's Media Studies faculty, and a researcher affiliated with the Digital Methods Initiative (DMI) as well as the Open Intelligence Lab (OILab). His current work draws on a mixture of close and distant reading methods to examine how online subcultures use infrastructures and vernaculars to constitute themselves as political movements.

Thais Lobo is a journalist and new media researcher working with collaborative investigations around online platforms and digital cultures. She is also a researcher and teaching assistant at the Department of Digital Humanities, King's College London, and has collaborated with the Public Data Lab in various data projects. She tweets in Portuguese and (occasionally) English at @thais_lobo.

This piece builds on “engaged research-led teaching” activities with the Public Data Lab network, in collaboration with the AHRC-funded Infodemic project. This includes projects with researchers, journalists and students at the Department of Digital Humanities, King’s College London as well as at the Digital Methods Winter School 2021 at the University of Amsterdam. Thanks to all of the students, researchers, journalists and workshop participants who contributed to the development of these investigations.

]]>
Turbulent with a chance of data: Journalism’s drone-powered futures https://datajournalism.com/read/longreads/turbulent-with-a-chance-of-data-journalisms-drone-powered-futures Thu, 05 Aug 2021 07:30:00 +0200 Monika Sengul-Jones https://datajournalism.com/read/longreads/turbulent-with-a-chance-of-data-journalisms-drone-powered-futures In 2011, The New York Times announced the arrival of drone journalism. Newsrooms were beginning to use drones to help journalists safely report on events that were difficult to attend—protests and environmental disasters. The mood was bright: drones were hip.

As communication scholars Lisa Parks and Caren Kaplan write, the news media had a nearly insatiable appetite for drones—celebrating the novelty of unmanned aircraft flying inside volcanoes alongside panic about drones invading domestic spaces.

Could news media use drones to better inform the public? To procure new data or do remote fact-checking with small unmanned aircraft? Could drones protect journalists, who have been targets for violence? Enthusiasm waxed. And—a decade later—waned.

According to Matt Waite, the “de facto dean” of drone journalism, who has led the Drone Journalism Lab at the University of Nebraska-Lincoln since 2011, currently, drone journalism is stalled.

“When we talked about drones years ago, a lot of the promise was that we have an ideal platform for photojournalism,” said Waite. “That’s happened. One of the most enduring images of the pandemic is going to be long car lines for food distribution, for testing facilities and vaccination sites. Those were shot with drones. Whether or not we will go to the next phase [with drones] remains to be seen.”

With their annoying buzz and invasive tendencies, criticisms of drones resemble the weaknesses of news media to sensationalise and intrude—exemplified recently with Japanese tennis player Naomi Osaka’s refusal to speak with the press at the French Open. “The public often doesn’t trust drones,” said Waite.

Drones are often used to take aerial photographs in places that humans cannot easily access, such as this image of Australian beachgoers in 2020. But Matt Waite, lead of the Drone Journalism Lab at the University of Nebraska-Lincoln, said more can be done. Credit: Photo by Manny Moreno on Unsplash

Drone-worthy panoramic shots

In 2021, the most common use of drones in journalism is photojournalism. Drones serve as a remote-controlled lens stowed in the photojournalist's camera bag.

“The view from above brings cinematic perspective to a simple event,” said Tomer Appelbaum, a photographer for the Israeli daily Haaretz. Appelbaum won a Siena award for an aerial view of socially distanced anti-Netanyahu protests in Tel Aviv in 2020.

But Appelbaum’s drone didn’t prevent Israeli police from confronting and briefly detaining the photojournalist while he covered a demonstration against Israel's West Bank annexation.

It’s not surprising drones are used for photojournalism; media consumption tends towards the visual, said Marisa Brandt, a STEM teaching professor at Michigan State University who studies technologies and society.

“It’s great to get click-worthy and noteworthy images [with drones],” she said. “But this does put us in a situation where a technology which allows us to create images that seem self-evident will get much more play and visibility than things that require a level of interpretation.”

There are other stories—data-led stories—that drones can help journalists to tell. Waite said using drones to do heat mapping, 3-D terrain mapping, sense pollution, or create 3-D models of buildings are areas of opportunity for data journalists.

The learning curve for using sensor-equipped drones is steep, requiring speciality equipment and knowledge.

Some of this work has happened. For instance, Radiolab and WNYC’s Data News Team launched a sensor project to detect ground vibrations to predict when cicadas will emerge from the earth.

While the New York Times has reported on scholars and journalists using drones and aeroplanes equipped with lasers to discover ancient sites across thousands of square miles with drone-made lidar maps.

Overall, said Waite, fewer efforts than he hoped have been undertaken. The learning curve for using sensor-equipped drones is steep, requiring speciality equipment and knowledge. And expensive, sometimes prohibitively expensive. “Such an investment just might be out of reach for many newsrooms and freelance journalists,” Waite said.

Not all is lost. There are ways that journalists and newsrooms can use drones, or drone-like technology, to critically propel data journalism forward.

An image of a no drones sign on a private property. Photo credit: Martin Sanchez/Unsplash

Dark origins: A brief history of drones

For all of their cool, gamer aesthetic—and yes, aviation hobbyists have used remote control aeroplanes for a quarter of a century—journalists should first remember that drones were initially developed and used for surveillance, military reconnaissance, and targeted killing.

The United States has used drones that have tracked, injured, and killed thousands of people in countries including Pakistan, Afghanistan, Somalia, and Yemen from 2010-2020, according to data collected by the Bureau of Investigative Journalism.

Meanwhile, National Public Radio reported in early June 2021 that the United Nations is investigating the first alleged case of a drone autonomously finding and killing a human on the ground in Liberia with no operator oversight. For millions of people, “daily life is haunted by the specter of aerial monitoring and bombardment,” writes Parks, an expert in surveillance infrastructures and distinguished professor at the University of California, Santa Barbara.

Goldman Sachs notes commercial drones make up less of the market but are responsible for a majority of the revenue.

It’s only been in the last decade that drones have gained market share in commercial and private sectors, which Goldman Sachs claims is worth approximately $100 billion. Globally, the largest consumer base of unmanned aerial vehicles (UAV) or unmanned aerial systems (UAS), drones, are militaries and law enforcement agencies, followed by agriculture and construction.

If the spectre of monitoring seems more banal than science fiction, that’s because constant surveillance is a normal part of everyday life in the early 21st century. From the internet-connected baby monitors to mobile phone tracking towers, drones are among the warfare technologies that have been brought to the domestic marketplace, “intensify[ing] militarization in everyday life,” Parks writes.

Newsrooms in the United States and Europe are often restricted from flying drones over groups of people.

Turbo-charge your stories with drone data

Journalists can face this reality head-on. Begin by seeking access to domestic drone footage collected by other sectors, including law enforcement, construction, or insurance companies, and reporting on it.

WFPL News, an National Public Radio affiliate and independent, nonpartisan news outlet in Louisville, Kentucky, USA obtained copies of more than 11 hours of footage from drones taken by the Kentucky State Police.

Following the deaths of Ahmaud Arbery, George Floyd and Breonna Taylor in the spring of 2020, officers flew drones to track demonstrations. WFPL’s investigation of drone footage shows police using force without prior instigation from demonstrators. The exception was a protester tossing a water bottle at the drone above; the police fired rounds of pepper spray. “That was fun,” said an officer, in one of the rare videos that included audio.

Newsrooms in the United States and Europe are often restricted from flying drones over groups of people. In late 2020, the Federal Aviation Administration (FAA) in the United States issued new guidelines allowing commercial drone operators to fly at night.

However, journalists can obtain drone footage taken by federal agencies such as law enforcement using information requests granted through the Freedom of Information Acts in the United States, the European Union and the United Kingdom.

In addition, journalists can request satellite images rather than using drones. In 2015, investigative reporters for the Associated Press used DigitalGlobe (now Maxar) satellites to glean evidence that cargo ships were trafficking enslaved humans from Myanmar and forcing them to fish. The data was used for the Pulitzer Prize-winning story, “Are slaves catching the fish you buy?”. After publication, 2,000 slaves were freed and the investigation inspired reform efforts in the United States.

Satellites can also provide data for disaster reporting: Quartz’s David Yanofsky did work on the California drought using satellite imagery and data to create nearly one hundred maps and charts. While not as granular as drones, satellites can provide high-level insights for less investment.

For mapping, drone operators interviewed in this article use fixed-wing drones by eBee, as pictured above.

Drones are expensive and learning takes time

While some hobby models cost only a few hundred euros, a commercial drone with a longer battery life can cost more than €10,000. In addition to equipment costs, learning and licensing can be time-consuming.

“Using drones is a specialised skill set,” said Johnny Miller, a freelance drone operator who has published arresting images of global inequality using drones for his project, Unequal Scenes; these photographs have been featured in publications such as TIME Magazine.

Miller co-founded africanDRONE in Cape Town, South Africa and is a fellow at Code for Africa.

“You need a variety of complementary skills that work together: being able to produce. Keep yourself safe, situational awareness. Drones have complexity with real consequences,” said Miller.

Situations can go sour quickly. Generally, a pilot should keep the drone in their line of sight and be able to land safely. The battery life of entry-level commercial drones might be twenty minutes or less. Appelbaum recalled a “nightmare” moment when another journalist’s drone lost power mid-flight. It crashed—luckily, no one was injured.

“The pilot in command is always responsible for the drone’s actions,” said Miller. “If it falls out of the sky and kills someone, you are going to jail.”

Professional training and labs are available. In 2017, Poynter, with the National Press Photographers Association, The Drone Journalism Lab at the University of Nebraska and DJI (Da-Jiang Innovation, a popular Chinese drone manufacturer), guided nearly 400 journalists in the United States to legally and safely operate drones.

Drone operators interviewed in this article use DJI Pro series for photojournalism, as pictured above.

But regulations, time, and costs can prevent smaller newsrooms and independent freelancers from taking advantage of drone’s affordances for journalism. Not to mention, many training programs have been on hold.

Waite said the COVID-19 pandemic set back the drone journalism programmes at the University of Nebraska. “It’s difficult to have ethical conversations about drones in asynchronous learning,” he said.

Alternatively, newsrooms can partner with experienced drone operators and open source data projects. africanDRONE is a pan-African organisation “committed to using drones for good,” including drone journalism, and welcomes community partners to implement projects.

For example, Frederick Mbuyu, co-founder of africanDRONE, led the Zanzibar Mapping Initiative (ZMI), funded by the World Bank and Tanzanian government, to remap Tanzanian lowlands and slums that were inaccurately or incompletely portrayed in satellite imagery.

Most drone projects are 80% planning and analysis, 20% drone flying

The project, called Ramani Huria (“our map” in Swahili) used drones and OpenStreetMap, an open-access mapping software, to make accessible maps that identified the most flood-prone areas. Over time, the maps have taken on greater significance and include numerous high-level details of the communities.

For projects like this, drones are not a one-and-done event. Most drone projects are 80% planning and analysis, 20% drone flying, said Miller, who collaborates with Mbuyu. These projects take months.

“It takes a team to do this,” said Miller. “You have to train people to learn to fly these drones, learn to fly in the patterns the maps require, then—crucially—interface with the community. Identify people in the community who can map on the ground. You need people on the ground. Drones cannot just parachute in and tell the story. The story comes first.”

Partnerships make sense: the eBee drones used to make the maps cost between €10,000 and €20,000. ­

A map of Dar es Salaam before (left) and after (right) Ramani Huria, a drone mapping project done in collaboration with africanDRONE, the communities, and journalists funded by the World Bank and Tanzanian government.

Develop long-term projects with community partners

Newsrooms and freelance journalists can apply for funding to develop community partnership projects that use drones for storytelling.

This was the tack taken by the Sensemaker project in the United Kingdom. Funded by the Google News Initiative, a collaboration between the Civic Drone Centre at the University of Central Lancashire, Manchester Evening News, and the Cringle Brook Primary School, the project sought to use sensemaking machines, including drones, for journalism.

“Stories about pollution, stories from pictures, stories about things that we haven’t yet imagined,” said John Mills, project lead and associate professor at the University of Central Lancashire. “These are contributing to journalism.”

Drone certification requirements vary nationally and change over time.

Paul Gallagher, publisher of the Manchester Evening News, said the initiative was unique because the storytelling didn’t start with trying to find a story in an existing data set. Instead, journalists and community partners asked questions and gathered data to help to tell a story.

The administration at the Cringle Brook Primary School in Manchester was concerned about air pollution, said Louise Taylor, assistant headteacher. She joined the Sensemaker team focused on detecting nitrogen dioxide while the Manchester Evening News reported on the efforts. After monitoring, Gallagher said they found pollution spikes in the morning and the mid-afternoon, coinciding with school drop-off and pick-up times. The school made a public presentation to the parents about the effort, and this led to a behaviour change. Fewer cars, less pollution.

“It’s not something that we’ve plucked from Google,” said Helen Chase, head of the school. “It’s a live statistic that happens here, now. They can’t argue about it. That gives us strength.”

Pay attention to regulations

Drone certification requirements vary nationally and change over time.

The European Union Aviation Safety Agency (EUASA) takes a risk-based approach to drone regulation. Drone use is not differentiated based on leisure or commercial purposes. Rather, drones are regulated based on weight and activity. This handy explainer for EU drone operators has information on regulatory categories and licensing.

In the United Kingdom, where European Union law no longer applies, all drone pilots are required to take a test and register drones, regardless of purpose or size.

In the United States, drone certification requirements are different for commercial and leisure drone activity. Under the Federal Aviation Authority's Small UAS Rule (Part 107) for commercial use, drone operators should obtain a Remote Pilot Certificate to demonstrate understanding of the safety procedures, operations, and regulations. The certificate includes taking an exam and must be renewed every two years.

Reporting on regulations and their asymmetrical implications can also serve the public.

“A drone is a frame,” Miller said. “Being able to see everyone from above has historically been reserved for the rich and governments. Then, all of a sudden, the common person can [use drones] to analyse the land around them. “[It can be] a democratising technology, but it’s not a silver bullet. It’s a mindset shift.”

Quick links

Earn your wings: Becoming a drone pilot

Drones used by drone operators in this article

Monika Sengul-Jones, PhD, is a freelance researcher, writer and expert on digital cultures and media industries. She was the OCLC Wikipedian-in-Residence in 2018-19. In 2020, she is co-leading Reading Together: Reliable Sources and Multilingual Communities, an Art+Feminism project on reliable sources and marginalised communities funded by WikiCred. @monikajones, www.monikasjones.com

]]>
Excel dynamic array functions: what data journalists need to know https://datajournalism.com/read/longreads/excel-dynamic-array-functions-what-data-journalists-need-to-know Thu, 08 Jul 2021 08:00:00 +0200 Abbott Katz https://datajournalism.com/read/longreads/excel-dynamic-array-functions-what-data-journalists-need-to-know Let’s face it: you’d rather write a feature than a formula, but when duty calls -- when you need to go one-on-one with a dataset that just might have something important to say to you –- it’s time to break out that spreadsheet, stoke your latte, and hope that you’ve caught the numbers in a seriously good mood. But if not, help is available: Microsoft 365's Excel dynamic array functions and their radically new and powerful take on formulas will help put you and the data on speaking terms.

While you don't need to be an Excel master to understand these new features, you should have a strong grasp of the tool and be comfortable with basic functions and developing formulas to learn from this article and comprehend the mini-tutorials that follow.

True, they took their sweet time in arriving –- Excel heralded the new functions in September 2018, but somehow managed to wait until January 2020 before finding someone home in my hard drive. But now that they’re here, you’ll find that the new functions streamline and ease an array of tasks that would have had you reaching for that pack of cigarettes –- and you don’t smoke.

Defining a dynamic array formula

But first things first: what’s an array formula, what distinguishes them from conventional spreadsheet formulas, and exactly what’s dynamic about this newest iteration?

Good questions. The answer begins to unfold with the recognition that array formulas aren’t quite a next big thing; in fact, they’ve been an entry in Excel’s (and Google Sheets’) catalogue of bells and whistles for quite some time, even if few spreadsheet users have ever turned to that page. But antiquity and obscurity notwithstanding, what you need to know is that an array formula performs multiple calculations that in turn yield multiple results.

Beginning with the basics

Here's a mini walkthrough tutorial to get you started. By way of simple illustration, consider this scenario: we’ve been brought face-to-face with a collection of sales figures, with the Number and Price data populating the range H2:I6 (the headers occupy cells H1:I1):

The activity above triggers the obvious question –- namely, how much money has this grocer managed to ring up? The standard means toward the answer would have us enter this formula in cell J2:

=H2*I2

After which we’d copy that elementary expression down the J column for another four cells, mustering five rows’ worth of sales in toto. We’d then finalise the business by inscribing a SUM formula somewhere, realising a bottom line of £48.52 as shown in the video below.

The total formula count: six.

Now whilst you probably didn’t need me to march you through those rudimentary paces, keep in mind that if our worksheet had been stuffed with 50,000 rows worth of transactions instead -– a not inconceivable prospect –- we’d require 50,001 formulas before we’d alight atop the bottom line: 50,000 instances of multiplication, one for each and every Number times Price, and one SUM to hoover them all into the grand total.

Now consider Plan B: the array formula alternative to our original, five-row multiply-and-sum challenge as demonstrated in the following video:

{=SUM(H2:H6*I2:I6)}

Trimming your formula to one single cell

You can probably see where the formula has taken the math. The two contributory ranges -- the ones bearing the numbers of items and their prices -- are lined up, each pair of values in their respective rows is multiplied internally somewhere in the formula’s cerebrum, and they’re all finally enwrapped by the SUM function that at last ushers the result to the worksheet.

In other words: the array formula -- in the singular -- crunches the multiple results and adds them all. The revised formula count: one replaces the original six.

And were we called upon to add up 50,000 rows worth of sales, the complement of array formulas we’d have to earmark for the task -- once you get the range references right, of course -- remains the selfsame, solitary, one:

{=SUM(H2:H50001*I2:I50001)}

Powerful stuff, then, these array formulas. True, they take a bit of getting used to, forcing as they do a dramatic trimming of the conventional formula-writing script. After all -- arrays do all the heavy lifting in a single cell, but scaling their learning curve is worth the trip.

But what about those curly braces? That er….brace -- of punctuations automatically surrounded pre-365 array formulas once they were inducted into their destination cells.

And tucking them into those cells obliged the user to do more than tap the standard Enter key; rather, what was called for was a curious, if legendary, triad of strokes: Ctrl-Shift-Enter, only after which would the formula barge into its cell, and the curly braces would clamp themselves around it. (The faux alternative wouldn’t work, by the way: the braces couldn’t merely be typed.)

What's new in Microsoft 365's Excel?

It was all rather quirky, perhaps, and all rather ancien regime, too; because with Microsoft 365 Excel came the dynamic array revolution, and with it a round of insurgent decrees, e.g:

  • Ctrl-Shift-Enter has been abolished. Now, all Excel formulas, including the most abstruse array concoctions, are delivered into their cells via a familiar, no-frills press of the Enter key.

  • And that means the curly braces are gone, too, sheared from any and all array formulas.

  • And moreover -- and this is the heart of the matter -- a single dynamic array formula can spawn a range of multi-cell results, a possibility that is utterly new to Excel.

By way of exemplification, recall the sales calculation we stepped through a few paragraphs ago. I stated that the inaugural formula H2*I2 inlaid in cell J2 -- the formula that totalled the first transaction -- would then have to be copied down the J column, preparing each succeeding transaction to be figured in its row. But with dynamic arrays we need only enter this formula in J2:

=H2:H6*I2:I6

And nothing else. The above expression will multiply every number by every price and post each result in its appropriate cell in the J column. And for Excel, that’s unprecedented. This video below shows how that's done.

One formula, boundless results

What we’re seeing is what Excel calls a spill range, the stream of cells loosed down the worksheet by the array formula’s nascent, extended reach. The dynamic array capability empowers users to write exactly one formula that can propagate thousands, or even hundreds of thousands, of results down and/or across a like number of cells, should their data-analytic needs call for them. And they’re dynamic because, if you organise the data properly, the spill range will wax and wane if you enter new records to the dataset, delete existing ones, or rewrite the formula (as you’ll see below).

And in keeping with its remit, Excel has rolled out a line of new functions programmed to channel the dynamic array spirit, ones that free you to do some pretty cool and efficient things.

Putting dynamic array spirit in action: ECDC case study

Start with the UNIQUE function, which, befitting its name, winnows redundant data in a field to a set of single instances and collects them all in a spill range.

To appreciate UNIQUE’s utility and elegance, consider the spreadsheet managed and updated by the European Centre for Disease Prevention and Control that tracks the accumulation of COVID-19 cases by country and continent.

In excerpt, the data look like this:

If we set out towards the most self-evident objective -- namely, a summing of cases by country -- a problem immediately congests the route.

We want to see a column’s worth of country names, adjoined by a column of corresponding case totals (we’d tabulate these via a standard function such as SUMIF, embellished by a dynamic function fillip); but because the ECDC data records and remembers the figures for each week from January 2020, each country name recurs dozens of times, even as we want them put before us but once.

Combing the data with UNIQUE function

And that’s where UNIQUE comes in. Assuming we’ve range-named the Country column Country, all we need do is check into a cell and write therein:

=UNIQUE(Country)

We see, in excerpt:

UNIQUE combs the data for every country name and pares the occurrences of each to one. Again: one formula, a multi-cell result.

Pretty neat –- literally; and if you don’t appreciate UNIQUE’s minimalist cunning, compare the formulaic equivalent offered by the pre-365 versions:

{=INDEX(country,MATCH(0,COUNTIF(O$9:O9,country),0))}

And you aren’t done there, either; once written, you’ll need to copy it down its column for as many country rows as you think you need, something you won’t know in advance.

But don’t be fooled just the same; elegance notwithstanding, UNIQUE won’t automatically sort the output it stacks down the column. The sort you see above is an artifact of the ECDC’s own, prior alphabetising of the source data; had the entries not been so arranged, UNIQUE would simply have presented them as they appeared from top to bottom among the records. Thus, if we were to apply UNIQUE to the data in the Continent column instead (range name Continent):

=UNIQUE(Continent)

The formula would drum up the following as shown in the video below:

Marrying SORT and UNIQUE functions

And that disordered outcome cries out for the SORT function, another arrow in the dynamic array quiver. If we wrap SORT around UNIQUE:

=SORT(UNIQUE(Continent))

As shown in the video below, you’ll find Africa lifted to the first position, and so on. (SORT also has a first cousin, SORTBY, that executes multi-column sorts.)

Pinpoint a data story with speed: ProPublica case study

By way of extended demonstration, we can also productively deploy UNIQUE at ProPublica’s compendium of civilian complaints levelled at New York City police officers, which in an excerpt read like the following screenshot:

The dataset archives over 33,000 complaints, but doesn’t answer this question: exactly how many individual, discrete officers were implicated by an aggrieved civilian?

The number is doubtless smaller, but that conjecture won’t suffice. So let’s reprise UNIQUE. Understanding that the first field in the data set, unique_mos_id, divulges the officers’ distinctive ids, we can enter, somewhere in the worksheet:

=COUNT(UNIQUE(unique_mos_id))

The virtues of COUNT

For the record, the count is 3996, conducing towards a complaints-per-officer average of somewhat more than eight.

But if you’ve been clicking along with me you’ll note that, in spite of all of the multi-cell derring-do that UNIQUE’s been orchestrating to date, only one cell spills from the expression above. That’s because the venerable COUNT function gathers and confines all the unique ids into the innards of its formula, keeps them there, and counts them there.

COUNT is known as an aggregating function – it inhales a range of values and simply counts them. And a count, when reported, simply requires one cell. But make no mistake; the unique officer ids have been generated, nevertheless – they’re in there, holed up in the formula, where they’ve been counted by the single-minded, single-celled COUNT.

Drilling down the data with FILTER

But it’s when you've cleared out some space in your skillset for the dynamic array FILTER function that you’ll really able to walk the walk with your data. FILTER does what it says on the tin; it carves out subsets of your data per a criterion or criteria you supply, and serves the results up with a facility that can’t be matched by the Filter tool, i.e.

A filter tool used in Excel appears above.

Just nail down a few of its none-too-daunting formula basics and FILTER will rise to the top of your to-go-to function list.

Two examples, both drawn from the ProPublica data, should affirm the virtues of FILTER: we could ask, for starters, about the number of complaints filed against women officers (we’re assuming throughout that all fields have been range-named after their headers, and that the entire dataset has been named All).

To avoid a messy conflation of FILTER with the source data, I’ll move into a new, pristine worksheet and enter, say in cell C4, the letter F for female (as it’s represented in the mos_gender field). Next, say in C10, I’ll write the following expression:

=FILTER(All,mos_gender=C4)

Translation: Extract, or filter, all those records in the dataset whose code in the mos_gender field reads F.

In excerpt, the results look like this in the below video:

I’ve coloured cell C10 yellow – the one and only cell in which I’ve written a formula – the better to highlight FILTER’s one formula/multiple-cell potency. The 1760 records harvested here mean that a bit more than 5% of the civilian complaints arraigned women officers; but of course, if you substitute M for F in cell C4 you’ll trigger the 31598 complaints brought against male officers instead.

Either way, it’s the same formula that’s promulgating these disparate outcomes -- and that’s what they mean by dynamic. It also means that if you delete that formula in C10, all the results it’s unleashed will disappear.

And now for the follow-on question regarding complaints against New York City police officers: Exactly how many unique women officers incurred at least one complaint? We’ve already compiled the overall count of the 3996 officers who were named, but because here we want to restrict the question by gender, the ensuing formula calls for a slightly more intricate expression. Continuing to reference gender in cell C4 -- again, the code F -- I can write:

=COUNT(UNIQUE(FILTER(unique_mos_id,mos_gender=C4)))

See the following video of the formula below:

This formula partners three functions, FILTER, UNIQUE, and COUNT and works this way:

  1. It filters all 33,000 records by the gender code F as denoted in the mos_gender field;
  2. It then proceeds to sieve unique instances of each remaining unique_mos_id (all of whom are women);
  3. It then counts the records that remain.

The result -- 387 actual women officers -- could then be subtracted from the global 3996 officer count, leaving us with the 3609 unique male officers who were likewise incriminated by complainants. But we could have also applied our formula to that task as well -- by simply substituting an M in cell C4 for F. And if you’re ever-so-slightly daunted by the formula, try writing it in a pre-365 version of Excel.

But FILTER can do more, much more. For one thing, it can launch queries that plumb the data by multiple criteria. For example: suppose we wanted to view the records of complaints filtered both by gender and year, e.g. all complaints registered against male officers in 2013. If we retain cell C4 for the gender coding, enter M there, and key in 2013 in say, C5, we could then write:

=FILTER(all,(mos_gender=C4)*(year_received=C5))

Note the new syntactical bits rushing into play here. When you nail several criteria into FILTER, each is enveloped by brackets (parentheses, in my part of the world), and they’re made to interact by interpolating an asterisk between them (what’s going on here beneath the formula’s bonnet is a storyline for The Sequel to this article). Get the jots and tittles right and you get the answer, something like this in excerpt:

Male officers, circa 2013, as noted in the following video:

Keep in mind that more complex permutations of FILTER will hand you the spade that’ll let you dig deeper into the data, though you do need to remember that the first encounter with Excel’s more baroque formulas, array or otherwise, can leave the user dazed and confused; but once you hone your spreadsheet chops you’ll be able to more vividly imagine, and liberate, the story possibilities imprisoned inside those cells -- and dynamic array formulas will help the journalist pry open the lock.

A native New Yorker, Abbott Katz currently lives in London. He has taught and trained Excel in diverse settings on both sides of the Atlantic, and has authored two books on the application (Apress). He has in addition contributed variously-themed pieces to New York Newsday, the Times Higher Education, insidehighered.com and other publications, and has a doctorate in sociology from SUNY Stony Brook.

]]>
A data journalist's guide to building a hypothesis https://datajournalism.com/read/longreads/hypothesis-data-journalism Fri, 04 Jun 2021 09:00:00 +0200 Eva Constantaras Anastasia Valeeva https://datajournalism.com/read/longreads/hypothesis-data-journalism Our next Conversations with Data podcast will take place on Tuesday 6 July at 3 pm CEST / 9 am ET with Eva Constantaras from Internews and Anastasia Valeeva from the American University of Central Asia, Kyrgyzstan. During our live Q&A, they'll discuss the power of building a hypothesis for data journalism and what can be done to address inequity with data. The conversation will be our second live event on our Discord Server. Share your questions with us live and be part of our Conversations with Data podcast. Add to LinkedIn or your Google calendar now.

Introduction

2020 pulled data journalism in two drastically different directions. On the one hand, the Black Lives Matter movement forced the data journalism community to question equity in the field: who is data journalism produced by, for and about? On the other hand, the pandemic offered a plethora of opportunities to channel the firehouse of coronavirus into shiny, often impersonal, dashboards of despair and death that quantified the scale of the pandemic.

The best data-led pieces of the year married these two trends into powerful investigations into the pervasive inequities laid bare by the pandemic, transforming statistics into concrete examples of specific harm to people that could be mitigated if addressed. One word describes these outstanding investigations: intentional.

The stakes for data journalism in the face of media polarisation, misinformation and disinformation are high as it struggles to find a role in the efforts to rebuild a healthy information ecosystem for citizens. As Lisa Charlotte Rost of Datawrapper asks in her blog post Less News, More Context, "With which information can my audience navigate this world better?”

Almost 10 years of teaching data journalism has taught us that the journalists who produce the most powerful investigations are the ones who started with a powerful idea, a powerful idea formulated as a hypothesis. This method, Story Based Inquiry, pioneered by Mark Lee Hunter, has been adopted by many data journalists and refined further for data projects, for example, The Markup Method. For us, it enables journalists around the world to harness data to explore and explain the drivers of inequality undergirding the news of the day.

One hypothesis -- many stories

After reviewing dozens of nearly identical coronavirus dashboards, we ran across a submission for the 2020 Sigma Awards that suggested the journalists had dug into the data knowing what they were looking for. The entry on the disproportionate number of deaths among Black Brazilians, by Publica, a non-profit investigative outlet, led us to more stories published by Publica on racial disparities in vaccine distribution and access to ICU beds among indigenous communities.

Though the data behind the stories was available to readers, the focus was the story, not the data. They have built a data journalism beat around disparities in healthcare access and a hypothesis-based approach allows them to drill deeper and deeper. They began with something like “Black Brazilians, who already scored low on an overall development index, are dying at faster rates than the general population” and then set out to see whether the hypothesis was true or not. Related stories refined this hypothesis to probe related disparities in healthcare equity during the pandemic. The rest of this story explores how to apply this approach yourself.

Formulating a hypothesis

Let’s read a couple of stories and formulate their hypothesis as a statement.

Perhaps you’ve come up with something like ‘Vaccine distribution is unequal’ or something more specific like ‘Vaccines are more available for high income countries in general, and, on an individual level, for wealthy people not the poor’.

They are both right. However, to be able to use a hypothesis as a tool for your own story, the second one works better. It formulates not only the idea, but also the means of proving it. This method is borrowed from social science, like a lot of data journalism techniques.

You don’t have to show this hypothesis-as-a-tool to your reader, but you do show it to your editor: it’s basically the pitch of the article. And since we want it to be convincing, it needs to be even more specific. What are the exact indicators that you will use to answer your questions? What is the unit of measurement? What time or geographic span are we looking at, and at what level of granularity? This is called the operationalisation process.

Let’s look at another story and formulate its hypothesis as a statement that is quite specific about the indicators.

You may have spotted that the text itself has both general idea (“the drop in employment is not gender-neutral”) and more specific statements which prove this idea, like this one: “The sectors most affected in the pandemic crisis--restaurants, retail, beauty, tourism, education, domestic work, and care work for the young and elderly -- have high female employment”.

Let’s write out the basic requirements for a viable hypothesis using a sample hypothesis: “Socio-economically marginalised groups are more likely to die of the coronavirus”.

  • Can either be proven or disproven with data. For example, ‘Poor people are more likely to die of coronavirus than rich people.’
  • Is specific about what is being measured. ‘Citizens living in areas of the city with a lower annual income according to the latest census are dying at a higher rate than those living in richer neighbourhoods.’
  • The data is available. ‘Coronavirus death records and income data are available by neighbourhood.’
  • The topic is important to the public. ‘Inequity in healthcare access resonates universally.’

How to avoid common pitfalls

Now, let’s look at the common mistakes for hypotheses and how we can avoid them.

  • One half or both halves of the hypothesis cannot be proven with data. In many countries, neither specific geo-located data nor geo-located income data is available. For example, in the case of Brazil, only race data was available, so the hypothesis had to focus on race by geographical area, not income.
  • The hypothesis is too fuzzy. The idea for a data story can often start from a broad, general idea like: ‘As the pandemic deepens, most EU countries become more pessimistic’. To make it work (and for anybody to care), you need to explain to yourself and to your audience what you mean exactly, how you will measure it and why it matters. In this Reuters story, the hypothesis may have been something like “Swedes and their pandemic policies were optimistic and open and they escaped the economic downturn that has spread across Europe”. Note, the story walks a fine line, presenting various correlations between attitude and economic indicators without making a causal claim.
  • The hypothesis is too broad. The topic is better for a book than a single story. Often, journalists try to tackle far too much in one story. It would take enormous time to explore all the variables that might influence the general problem. So why not focus on a specific aspect of your problem and explain it from A to Z? Instead of having a huge covid data dashboard with lots of demographic data but no stories, drill down and identify specific, compelling stories that justify having a database. For example, in our India job loss example, the journalist has a hypothesis focussed on job loss related to the sectors where women are employed. This story pursues a related but distinct hypothesis: care work during the pandemic is forcing women out of the workforce.

Both of these reveal specific insights into barriers to economic recovery faced by women without getting lost in obvious generalisations about gender inequality.

  • The hypothesis is too narrow: it only measures how one factor influences a trend and discounts other data sources that might also contribute to it. Here is an example of how Rappler in the Philippines has dealt with the difficulty of identifying a pattern in the surge of coronavirus cases. While they start with a hypothesis about spikes in busy commercial areas, they also address the possible influence of factors such as concentration of violation of health and safety protocol.
  • The hypothesis has already been proven true and is common knowledge.

A lot of data journalists around the world have shied away from “the procurement process is corrupt” stories because of course it is! Instead, they use very narrow examples to pursue accountability on a local level. Pajwok Afghan News’ data team pursued a hypothesis related to procurement price infliction of specific medical supplies. Dataphyte in Nigeria so aggressively pursued individual contracting irregularities that they forced the government to divulge more contract details.

The good news is that you can almost always make a weak hypothesis stronger by doing the research needed to make it more verifiable, specific, interesting and concise. Another piece of good news is that even if you prove your hypothesis false, what you did find is probably still a compelling, and maybe even a more surprising, story.

From hypothesis to questions

And now let’s dive a little deeper. The hypothesis-driven approach also lends itself well to developing research questions to prove your hypothesis true or false. Sticking with research questions that probe your hypothesis serve the same purpose as writing out interview questions for a difficult source ahead of time: it allows you to organise your thoughts and ensure you get the answers you need.

Let’s read this data story and pull out the major findings. Then we will reverse engineer the hypothesis and questions:

If we list the data arguments in this piece, we can get something like this:

  • The majority of Indigenous Lands (TIs) in the Amazon have been identified as in critical condition due to the coronavirus pandemic in Brazil.
  • Of 1,228 Brazilian municipalities where there is at least a stretch of TIs, only 108 have an ICU bed, so less than 10% of Brazilian municipalities with indigenous lands have ICU beds.
  • More than 80% of all TI lands in the country are concentrated in the North, precisely the region that, along with the Northeast, has the largest ICU deserts in the country.
  • The maternal mortality rate for indigenous people is highest among all races, even when controlling the socioeconomic level. The deaths among those in the indigenous community are undercounted.
  • Among the 10 regionals that have been identified as most vulnerable to the coronavirus, seven haven’t been officially recognised for protected indigenous status.
  • About four out of five households in indigenous territories did not have a water supply and a third of households on indigenous lands did not have a bathroom for exclusive use.
  • In 17 TIs, at least one-fifth of the population was over 50 years of age, which is considered a risk factor for coronavirus.
  • Researchers have called for the establishment of specific strategies for the care of indigenous peoples.
  • Another recommended solution is the construction of field hospitals exclusively for indigenous people.

From this list of answers, we can reverse engineer a hypothesis and a list of questions:

Hypothesis

  • Indigenous communities are facing an acute health crisis during the pandemic due to under-resourced health facilities and underlying health conditions.

Problem

  • Are indigenous communities dying at a disproportionately high rate?

  • Do indigenous communities have worse access to ICU beds than the rest of the country?

Impact

  • What proportion of indigenous lands are considered in critical condition now?

  • Are indigenous communities considered to be in a more critical condition during the pandemic than the rest of the country?

  • What proportion of the population of indigenous communities is considered high risk?

Cause

  • How did maternal mortality rates of indigenous people compare to the general population before the pandemic?

  • How did access to clean water in indigenous communities compare to the rest of the population before the pandemic?

  • How complete are death records considered in indigenous territories compared to the rest of the country?

  • How complete are death records among indigenous communities?

  • How complete is the registration of Indigenous Territories?

Solution

  • What strategy can be employed to close the gap in access to healthcare and mitigate the vulnerability of indigenous people?

We can see these questions touch on different parts of the problem. While some describe the scale of the problem, others focus on the impact of the problem on a particular group of people, and others dive into the causes and factors behind that. Finally, there are questions about the possible solutions or ways to mitigate these consequences.

You can apply this general list of questions nearly to every data story that dives into the roots of the problem and aims to build a concise narrative around it:

Problem:

  • How big is the problem?
  • Is it getting worse or better?

Impact:

  • Which category of people is more likely to experience the consequences of the problem/benefit from the situation?
  • How does the problem affect this group of people?

Cause:

  • What are the main causes explaining why the problem is disproportionately affecting these people?
  • Which factors have contributed to this?

Solution:

  • What needs to be fixed for the impacted group of people to mitigate the consequences or solve the problem for them?
  • How much would it cost and is there a source of money for this?
  • Has anybody already tried to solve this problem, here or elsewhere?
  • How can we measure the effectiveness?

These questions help the story remain focused on the specific hypothesis that the journalists have set out to prove or disprove. The questions ensure they drill deep into the issue and explain the problem from various angles using data. A great data hypothesis consists of questions that can be answered with data to prove or disprove it.

Conclusion

In conclusion, a good hypothesis can be proven with the data that exists and generated new insights into an issue. It also measures the problem, causes, impact and solutions.

A hypothesis is a great way to build up beat reporting around an issue your audience cares about. For example, check out these variations of the previous hypothesis:

  • Indigenous communities are facing an acute economic crisis during the pandemic due to under-resourced economic recovery programmes and chronic lack of local investment.
  • Indigenous communities are facing an acute education crisis during the pandemic due to an under-resourced education system and chronic lack of access to the internet.

Many favourite issues covered by data journalists: politics, healthcare, education, the economy, are universal. Reading how other data journalists explore and explain these issues is a way to find inspiration to generate meaningful stories about and for your community and help communities make sense of pressing issues like inequity. Adopting a hypothesis-driven methodology established a workflow to build data-driven beat reporting around complex, often misunderstood problems that are not going away anytime soon and require meaningful and informed citizen engagement to change the status quo.

Eva Constantaras is a data journalist specialised in building data journalism teams in the Global South. These teams have reported from across Asia, the Middle East, Latin America and Africa on topics ranging from broken foreign aid and food insecurity to extractive industries and public health. As a Google Data Journalism Scholar and a Fulbright Fellow, she developed a pedagogical approach and manual for teaching investigative and data journalism in high-risk environments. Follow her on Twitter: @evaconstantaras

Anastasia Valeeva is a data journalism trainer and open data researcher. She has taught data journalism in Europe, the Balkans, Central Asia and Russia and is currently a data journalism lecturer at the American University of Central Asia, Kyrgyzstan. She is also a co-founder of School of Data Kyrgyzstan. She has researched the use of open data in investigative journalism as part of her fellowship at the Reuters Institute for the Study of Journalism, Oxford. Follow her on Twitter: @anastasiajourno

]]>
Conflict reporting with data: a guide for journalists https://datajournalism.com/read/longreads/conflict-reporting-with-data Mon, 03 May 2021 16:30:00 +0200 Sherry Ricchiardi https://datajournalism.com/read/longreads/conflict-reporting-with-data For years, Syria’s dark dungeons have functioned as hellholes of torture, starvation and murder. Thousands have vanished into this vast network, never to be heard from again; many emerged maimed and broken.

In a stunning investigation, The New York Times laid bare the sadistic violence and put on record war crimes committed by President Bashar al-Assad’s regime in these halls of horror.

Data on the dead and missing was a driving force behind the story by Anne Barnard, a former Times’ Beirut bureau chief and veteran of covering the armed conflict.

“It is almost impossible to do justice to the depth of Barnard’s reporting and the evil it described,” wrote The New Yorker’s Isaac Chotiner when the story was published in May 2019. During a Q&A with the author, he asked how “astonishingly bleak figures” of torture and murder had been gathered.

Barnard reported that “Nearly 128,000 have never emerged, and are presumed to be either dead or still in custody, according to the Syrian Network for Human Rights, an independent monitoring group that keeps the most rigorous tally. Nearly 14,000 were “killed under torture.”

More data is not necessarily better data. We need to know where it is coming from, what is included and what is not.

In a New York Times’ Insider piece about the investigation, she noted, “We doubled our efforts to cover the story, as human rights groups steadily compiled data on dozens of torture facilities, tens of thousands of disappeared Syrians and thousands of executions of civilian oppositionists after sham trials.”

Her team spent weeks in Turkey, Germany and Lebanon listening to survivors’ recollections and cross-referencing them. When she began covering Syria in 2012, it was a different scenario.

“We focussed on visible war crimes – ones we witnessed in person or quickly verified through witnesses and videos . . . By contrast, detention, torture and execution were unfolding unseen in secret dungeons, recorded mainly in the minds of survivors,” Barnard wrote in the Times.

Above shows an image of the destroyed Homs city centre in Syria. The country's third-largest city was a key battleground in the uprising against Bashar al-Assad.

Data journalism played a vital role in exposing Syria’s atrocities through the use of advanced digital forensics, geolocation and data visualisation among other high tech tools that improve the accuracy and impact of war reporting.

Newsworthy statistics on conflict are widely available, but here is the challenge: How do journalists identify the right datasets to use? How do they evaluate data sources among the many out there? What should they be looking for?

“More data is not necessarily better data. We need to know where it is coming from, what is included and what is not. Reporters should not take all data as unbiased facts,” said Andreas Foro Tollefsen, senior researcher for the Peace Research Institute, Oslo, a main player in the conflict data field.

Conflict Reporting With Data

Reviewing some of the producers of conflict event data sets and their specialties helps narrow the search. Following are thumbnail sketches of three large-scale data-collection projects often cited in media reports and scholarly studies. Also included are examples of how the media use these data sets to report on armed conflict and how these groups collaborate.

Uppsala Conflict Data Program (UCDP): Data sets on conflict and peacekeeping including peace agreements, intrastate armed conflict, non-state conflict, one-sided violence, and conflict termination. UCDP offers datasets on organised violence and peacemaking, which can be downloaded for free through the UCDP downloads website. Illustrative charts, graphs, and maps are also available.

Armed Conflict Location & Event Dataset (ACLED): Described as disaggregated data collection, analysis and crisis mapping platform. Collects real-time data on the locations, dates, actors, fatalities, and types of all reported political violence and protest events across the globe. Users can explore data with an interactive dashboard. Operates under the slogan, “Bringing clarity to crisis.”

Peace Research Institute, Oslo (PRIO): Explores how conflicts erupt and can be resolved; investigates how different kinds of violence affect people and examines how societies tackle crises. Data projects include collecting conflict start and end dates to aid in the study of the duration of violence and adding figures for yearly combat deaths. Active research projects are listed alphabetically and include dozens of topics.

Statistics showed that nearly 10 times more children are at risk now than three decades ago when the toll was 8.5 million.

The data these groups collect is impressive and, at times, daunting. How does it translate into real-world news reporting? What kind of stories does it help to tell?

“Journalists equipped with data and empirics have a very powerful instrument to enlighten, but also to ask decision-makers the right questions,” said PRIO’s Tollefsen, a self-described human geographer. “Looking at data often challenges narratives or generates inquiries that wouldn't be evident without it.”

For instance, data can show trends, maps, and patterns, highlighting whether violence has gone up or down in a region, where conflict is located, and how this relates to conditions on the ground that impacts civilians, such as migration or refugees. A story based on ACLED data showed how conflict in northern Mozambique displaced over half a million people.

PRIO’s data was used in combination with other sources in a recent Save the Children report, “Weapons of War: Sexual violence against children in conflict.” Statistics showed that nearly 10 times more children are at risk now than three decades ago when the toll was 8.5 million.

Save the Children's report showed that 72 million children live 50 kilometres or closer to conflicts where armed groups have perpetrated sexual violence against children.

Syria, Colombia, Iraq, Somalia, South Sudan, and Yemen were identified as countries where children are at greatest risk of sexual violence, including rape, sexual slavery, forced prostitution and pregnancy, sexual mutilation, and sexual torture at the hands of armed groups, government forces, and law enforcement.

“Data is a powerful tool and challenges our understanding of the world,” said Tollefsen whose work focuses on the use of geospatial data and Geographic Information Systems for research on the cause and consequences of conflict.

In another collaborative effort, PRIO examined trends in Middle East conflicts between 1989 – 2019 and compared them to global trends, using UCDP data. Researchers analysed conflict recurrence, ceasefires, and peace agreements during the same period.

The study found that over the past 10 years, the majority of the world’s deadliest conflicts have been in the Middle East, with Syria being the deadliest. Data also plays a pivotal role in breaking news, such as the impact of COVID-19. As the pandemic spread, data journalism provided vital, reliable information through the use of interactive maps, graphics and charts marking cases and deaths.

The Washington Post used ACLED data to explore: “Does COVID-19 raise the risk of violent conflict?” The number of conflict events was tracked over time to see whether trends changed after the World Health Organization declared a global pandemic in March 2020, or the country declared lockdown.

In the story, the Post defined ACLED as “A database that counts the number of conflict events daily around the world. For 2019 and 2020, ACLED includes more than 100 countries in Africa, Asia, Latin America and Eastern Europe — and tracks three categories of violent conflict: battles, violence against civilians and explosions/remote violence.”

A screenshot of the Washington Post article analysing how the COVID-19 pandemic has impacted the risk of violent conflict.

A June 2020 article that appeared in Johannesburg’s Mail and Guardian examined how the pandemic had shifted patterns of conduct in Africa.

“Tracking activity over the past 10 weeks, ACLED found that conflict rates held steady across the continent, but patterns of violence have shifted as armed groups and governments take advantage of the pandemic to make moves on political priorities,” wrote ACLED’s director, Clionadh Raleigh.

A detailed infographic accompanied her story. Finding appropriate sources of conflict data is the first step. Interpreting the information correctly can be challenging. Following are tips from experts on how to analyse and evaluate research for accuracy, reliability and fairness, the hallmarks of quality reporting.

Evaluating conflict data

Harvard University research fellow Kelly M. Greenhill co-authored a book “Sex, Drugs, and Body Counts: The Politics of Numbers in Global Crime and Conflict” exploring the misuse of conflict statistics and showing that miscalculations can have “perverse and counterproductive consequences.”

Getting it wrong might, for instance, help prolong wars, give governments an excuse not to act or muddy evaluations of policy successes or failures in conflict zones.

What can journalists do to get it right? Greenhill provided the following questions that should be routinely asked about crime and conflict data:

  1. What are the sources of the numbers?
  2. What definitions are the sources employing, and what exactly is being measured?
  3. What are the interests of those providing the numbers?
  4. What do these actors stand to gain or lose if the statistics in question are – or are not – embraced and accepted?
  5. What methodologies were employed in acquiring the numbers?
  6. Do potentially competing figures exist, and, if so, what is known about their sources, measurement and methodologies?

“To some degree, the politicisation of statistics is inevitable and unavoidable. But journalists as consumers and disseminators of statistics should be more savvy and credulous in their acquisition and utilisation of these data,” said Greenhill, a political science and international relations professor at Tufts University.

She is concerned journalists could inadvertently serve as amplifiers of politically motivated statistical distortions leading to “counter-productive policy outcomes.”

Greenhill has been a consultant to the World Bank, U.N High Commissioner on Refugees, and an analyst for the U.S. Department of Defense. She is working on a new book exploring the influence of rumours, conspiracy theories, propaganda, so-called "fake news" and other forms of extra-factual information on international politics.

Pay attention to what sources the datasets are using. That makes a really big difference.

Comparing Datasets

Assessing similarities and differences in datasets is another way to evaluate how useful they might be for conflict coverage.

Roudabeh Kishi, ACLED’s director of research and innovation, co-authored a report comparing conflict data from several well-known sources, including the Global Terrorism Database (GTD), Integrated Crisis Early Warning System (ICEWS), the Phoenix event dataset, and Uppsala Conflict Data Programme Georeferenced Event Dataset (UCDP GED).

The stated purpose of the study: “To demonstrate how the collection [of event data] mandates, coding rules, and sourcing methods can result in drastically different information on political violence and interpretations of conflict.” Researchers also looked at human-generated data versus automated.

What should journalists look for? Among indicators on the veracity of data, the report listed:

  1. Sourcing: “Extensive sourcing – including from local partners and media in local languages – provides the most thorough and accurate information on political violence and demonstrations, as well as the most accurate presentation of the risks that civilians experience in their homes and communities.”
  2. Transparency: “Data sets must be usable if they are to be relied upon for regular analysis and users should be able to access every detail of how conflict data are coded and collected.”
  3. Coverage and classification: “Clear, coherent, and correct classification is important for users because conflicts are not homogenous: disorder events differ in their frequency, sequences, and intensity.”

“Some people say,`Well, data is data and that’s it.’ It’s important to remember these are different projects with different methodologies, definitions, and usually based on a specific mandate,” Kishi explained. “Pay attention to what sources the datasets are using. That makes a really big difference.”

She advises beginners to learn simple software such as Tableau, an analytic tool that can help in the creation of maps and charts or filter by actors or by violence. There is a training video on ACLED’s website.

Supplying statistics to media is part of our advocacy,

Kishi provided Tableau training for the Syrian Network for Human Rights(SNHR), one of ACLED’s partners. “They are great. You’ll see them in our data,” Kishi said of the Syrian human rights group that supplies data based on actual counts of reports, rather than extrapolations or estimates.

Since its beginning in 2011, the group has amassed a vast archive of eyewitnesses’ names, contact information, and testimonies, as well as the photos and videos that are meticulously preserved as background for scholarly research, use in news stories or for future war crimes trials.

“Supplying statistics to media is part of our advocacy,” said Fadel Abdul Ghany, the group’s founder and director. He oversees a volunteer corps of 35 mostly in Syria, but also living in Turkey, Jordan and Lebanon where they have taken refuge.

SNHR’s data has been a key source for the U.N. High Commissioner for Human Rights and appears in U.S. State Department reports on Syria. Monthly reports document recent death toll, assassinations and attacks against civilians. In January there was information about the spread of COVID-19.

Fadel’s goal is to keep Syria in the spotlight. “Stories in top media like The New York Times, Washington Post, and CNN draw the attention of others,” said Ghany from his home in Qatar. “We are always ready to help journalists. Send us an email, tell us what you are working on and let us know how we can cooperate.”

Other Resources That Can Help

ACLED’s Ten Conflicts to Worry About in 2021: Includes Myanmar, Belarus, Yemen, Ethiopia, India and Pakistan. ACLED advises accessing data directly through the “export tool” and finding information about methodology under “Resource Library.” A video walks users through the data collection process.

The UCDP’s Conflict Encyclopedia: Describes itself as a “main provider of data on organized violence and the oldest ongoing data collection project for civil war, with a history of almost 40 years.” Offers a web-based system for visualizing, handling and downloading data, including ready-made datasets on organized violence and peacemaking free of charge

Stockholm International Peace Research Institute: Provides data on peace operations conducted since 2000, military expenditures, arms transfers and embargoes.

George Mason University Libraries InfoGuides: Resources for research in conflict resolution, peace operations, armed conflict, and human security. Includes link for peacebuilding, human security and terrorism datasets.

Amnesty International: Offers two free courses on Open Source Investigations in four languages, Arabic, Persian, English and Spanish. Courses act as guides to using open source research methods in practice, with a focus on human rights investigation and advocacy. Instruction on cutting-edge tools and techniques taught by experts and practical exercises.

Global Terrorism Database, University of Maryland: Bills itself as the “most comprehensive unclassified database of terrorist attacks in the world.” Includes more than 200,000 terrorist attacks dating back to 1970.

WomanStats: A “comprehensive compilation of information on the status of women in the world.” It combs the extant literature and conducts interviews to find qualitative and quantitative information on over 310 indicators of women's status in 174 countries.

Commission for International Justice and Accountability: Stated purpose, “Achieving justice for crimes that impact vulnerable populations across the globe, including war crimes, crimes against humanity, genocide, terrorism, human trafficking, and migrant smuggling.” Works to support prosecutions in 13 countries and assists 37 law enforcement and counter-terrorism organisations globally.

Syrian Observatory for Human Rights: Monitors political, military and humanitarian developments in Syria with network of sources inside the country and internationally. Reports appear on the SOHR website, Facebook and Twitter and often cited by major news outlets and other rights organisations.

]]>
Data visualisation by hand: drawing data for your next story https://datajournalism.com/read/longreads/data-visualisation-by-hand Wed, 24 Mar 2021 09:30:00 +0100 Amelia McNamara https://datajournalism.com/read/longreads/data-visualisation-by-hand We live in a world where data visualisations are done through intricate code and graphic design. From Tableau to Datawrapper and Python and R, numerous possibilities exist for visualising compelling stories. But in the beginning, all data visualisation was done by hand. Visualisation pioneers like W. E. B. Du Bois and Florence Nightingale hand-drew their visualisations because there was simply no other way to make them.

Above shows "Diagram of the causes of mortality in the army in the East" by Florence Nightingale, 1858 via Wikimedia Commons Florence Nightingale, the woman who revolutionised nursing was also a mathematician who knew the power of visualising information with hand-drawn images.

For Du Bois it was his team of black sociologists who explained institutionalised racism to the world using data visualisations, while for Nightingale it was her diagram showing the causes of mortality.

Close up on Atlanta University's "City and Rural Population. 1890" data visualisation by W. E. B. Du Bois (Source: Public Domain).

And, even as computers developed, it was often easier to visualise using analogue means. This article will explore the history of hand-drawn visualisations and the case for presenting them in this style. It will also show examples from experts who have opted for the pencil over the screen. You'll also learn some top tips to help get you started.

A short history

When Mary Eleanor Spear wrote her pioneering visualisation book Charting Statistics in 1952, she emphasised graphics that could be easily hand-drawn. For example, the 'range bar chart' (a predecessor of the boxplot) is a simple summary graphic for one numeric variable -- a relatively simple visual to create without a computer.

Unlike a histogram, which would require deciding on breakpoints a priori and counting the number of cases that fall into each bin, a boxplot relies only on a few summary statistics. The analyst would calculate the median, the quartiles, and account for outliers, then get out their ruler and pencil to draw the visualisation.

The above shows two pages from "Charting Statistics" discussing range bars. The book was written by Mary Eleanor Spear, visual information specialist at the U.S. Bureau of Labor Statistics, Graphic Consultant and Illustrator, and Lecturer at The American University.

John Tukey popularised these ideas in his 1977 book Exploratory Data Analysis, where he also emphasised that graphics could be easily hand-drawn.

The idea of Exploratory Data Analysis (now commonly abbreviated EDA) is to compute summary statistics and make basic data visualisations to understand a dataset before moving forward. Every graphic in the EDA book was made by hand by Tukey, although he was so precise they can be mistaken for computer-generated charts.

Tukey's EDA book includes advice on the materials an analyst would need, like tracing paper, which he says allows the analyst to use graph paper as a guide while leaving the final visualisation clearer.

Jacques Bertin, another visualisation pioneer, was also focused on making a data analyst as effective as possible without a computer. One of his strategies was to create a ‘Bertin matrix’, a physical representation of an entire dataset, which could be reordered by the use of long skewers stuck through it. His graduate students would work to find an ordering that showed structure in the data, then photocopy the physical matrix to retain a version of the data before moving on.

Above is an example of a Bertin matrix.

Is digital better? A case for hand-drawing visuals

So, handmade data visualisation is not something new. In fact, it is the original form of visualisation! But, as computer tools have evolved to make it easier to create data visualisations, more and more visuals are ‘born digital’. That doesn't mean the need for handmade visualisation has disappeared, or that computer-generated graphics are better. There are several reasons why I advocate for journalists to experiment with hand-drawing visuals:

  1. It gets you thinking outside the box. If you are someone who is adept at using computer tools to generate visualisations, you may only think of the visual forms most easily generated by your tool.

  2. A handmade visualisation can lend a feeling of friendliness to a story. Quite often, computer-generated visualisations feel sterile and can be inaccessible to certain audiences.

  3. Handmade visualisations feel less ‘truthy’, so they can be a great way to convey uncertainty.

  4. Making visualisations by hand is a concrete way to learn the way that data values are coupled to visual marks.

  5. It's fun!

Sometimes a handmade visualisation is a product you make for yourself, to help you brainstorm, understand your data, or just as a creative outlet. Other times, a handmade data visualisation can become your final product, published for others to read and experience.

And there are many handmade visualisations to draw inspiration from. The book Infographics Designers Sketchbooks is filled with behind-the-scenes looks at how visualisations began their lives. While some of the authors do their sketching in code, the vast majority begin by drawing on paper. So, hand-drawn visualisation can also be a step on the way to something computer-generated.

Perhaps more interesting are the hand-drawn visualisations that end up getting published in one way or another. In the category of personal visualisation, the project Dear Data, by Giorgia Lupi and Stefanie Posavec is a prime example.

The reports were works of art, like an autobiography in data visualisation.

Lupi and Posavec are both professional designers, and their client work (typically computer-generated) can be seen in a variety of contexts.

For Dear Data, they took another approach. Every week for a year, they each collected data on an agreed-upon topic about their lives (like laughter, doors, or complaints) and generated a hand-drawn visualisation of that data on a postcard. They mailed the postcards transatlantically to one another.

While data visualisations often aim to accurately convey information to a reader, that wasn't the goal for Lupi and Posavec.

Instead, they wanted to convey some sense of their lives to one another. Readers aren't asked to decipher the precise values they put on the page, but rather to draw inspiration from beautiful forms, and enjoy what is closer to a narrative or memoir of the authors’ lives.

There are other data artists who produce work in this space, like Nicholas Felton, who for years produced the annual ‘Feltron Report’.

People who bought the Feltron Report weren't doing so in order to learn something new about the world, but to appreciate Felton's work. Again, the reports were works of art, like an autobiography in data visualisation.

What the experts say

Research on data visualisation often focuses on how effective a visualisation is at conveying the precise information it encodes. In 1984, William Cleveland and Robert McGill published a paper called Graphical perception: Theory, experimentation, and application to the development of graphical methods.

This paper (cited more than 1,600 times!) outlined the results of their experiments on graphical perception. If you have heard arguments for the use of bar charts instead of pie charts, the data likely came from this 1984 study.

Their study showed how bad people are at judging areas (of circles or other shapes) and cautioned against the use of area as a method for graphical encoding. IEEE Vis, a professional community and conference for computer scientists studying visualisation, continues to publish papers along these lines.

For example, the paper Ranking Visualizations of Correlation Using Weber's Law, demonstrated which data visualisations made it easiest for readers to assess correlation visually.

However, the goal of visualisation does not always have to be to encode information in such a way that it is easy to read off exact values. Often, the most important thing is to give a truthful impression of the data. And, the most technically correct visualisations may not always be the best way to convey that impression.

Often, the most important thing is to give a truthful impression of the data.

One important component of visualisation is attention -- a person can't read and understand a visual unless they pay attention to it. Visualisation critic Edward Tufte often advocates for the simplest possible visualisation, by maximising the data-ink ratio.

Darkhorse Analytics produced a gif example of what this process could look like. In many cases, it is better to reduce the amount of visual clutter and non-data ink, but other times it seems Tufte takes this too far, such as his redesign of the boxplot that ends up as a broken line with a dot in the centre.

Data visualisation expert Nathan Yau advocates for what he calls ‘whizbang’. Whizbang is the cool factor (often animation or interaction) that draws people into your visualisations. In a world filled with digitally-generated visualisations, a handmade visualisation might be just the whizbang you need to draw readers in.

Data journalist Mona Chalabi has embraced this idea, creating many hand-drawn visualisations that are published as finished pieces in The Guardian and elsewhere.

Chalabi is the data editor at The Guardian, so she knows the ‘rules’ of data visualisation. But she also understands when it makes sense to bend or break them. Her OpenVisConf talk, Informing without Alienating, discusses her philosophy of making graphics that inform as many people as possible.

Chalabi considers the context in which her visualisations will be seen. She also draws her visualisations using familiar objects, to help readers understand things like units. For example, she created a visualisation to answer the question ’How much pee is a lot of pee?’ using common soda bottle sizes:

In another piece, she showed sugar consumption over time in the US and the UK, using sugar sprinkles.

Beyond the familiarity granted by the objects Chalabi draws, the hand-drawn nature of her visualisations make them feel less precise. Again, this is her intention. Numbers and computer-generated visualisations often ‘feel’ true, but there is always some amount of uncertainty that surrounds them.

By drawing her visualisations, Chalabi is able to convey some amount of variability. When you look at her visualisation of how much air pollution is emitted and inhaled by people of different races, you won't be able to read off the exact numbers. (In fact, you can't read off numbers at all -- the chart does not have labeled axes!) Instead, you will be able to see which group's share is largest, and by how much.

Chalabi is working from her gut, but researchers at Bucknell University have begun to study how different groups interpret data visualisations differently. So far they have focused on a particular rural population, but you can imagine how this work could be extended to other subgroups. One of their key findings is that visualisations are personal.

Often, we imagine that we can generate an idealised representation of a particular dataset, but everyone comes to our visualisations with their own identity and prior beliefs. For certain groups, a visualisation may work well, and for others not at all.

Bucknell’s researchers point out that many of the field’s historic studies on perception relied upon a homogeneous group of people (often, college students, who tend to be whiter, richer, and, of course, more educated than the general population).

Hand-drawn visualisation in practice: an exercise to get started

Hopefully, by now I've convinced you there are many benefits to drawing your visualisations by hand. But how do you actually begin doing it? My main recommendation is to just put pencil to paper and start. Many people believe they can't draw, but almost everyone drew as a child, and simply got out of the habit. As with other skills, the more you practice the better you get.

You don't need any particular materials to draw visualisations. A ballpoint pen and a piece of paper will do in a pinch, but having a few colours makes things more fun. When I do the following exercise with students, I encourage them to bring along any materials they have lying around (I'd guess you have some fun art supplies tucked into a corner of your desk somewhere) and I also show up with some more things to share.

All sorts of craft materials can help you create data visualisations by hand.

Again, these supplies don't have to be expensive. I bought mine at a local surplus store and spent less than $50 on a bag full to bursting that I've used for several sessions already. Some supplies to consider:

  • coloured construction paper
  • scissors
  • markers
  • glue
  • rulers.

Beyond these standard school supplies, I've brought along balloons, pom-poms, watercolour paints, wire, string, and other ephemera. I find that having lots of stuff to play with opens up creativity. If you want to get fancy, the supplies suggested by Sketch-a-Day are a good start.

Sketching, ideation, and iteration are key elements to design thinking, another area from which I draw inspiration. Artists and designers regularly generate many sketches before settling on a final idea to flesh out. If you begin by making (say) four sketches of a visualisation before picking a particular form, you may find that your later ideas are much stronger than your first one.

Again, working by hand makes it easier to quickly generate a few ideas. If you want to stretch yourself, try making 10 sketches before you begin, or take inspiration from Nathan Yau and generate 25 different possibilities.

Do this quickly! Spend 30 seconds to a minute on each sketch, so you don't get too bound to a particular idea. Remember, you can come back to the one you love after the sketching period ends. If you run out of ideas, it's okay to draw subtle variations on an idea you want to explore, but aim for sketches that are as different as possible. Again, this is a good way to exercise your creativity.

If you are working with others, consider holding a ‘design charette’. Explained by Kara Pernice, from the Nielsen Norman Group, as "a short, collaborative meeting during which members of a team quickly collaborate and sketch designs to explore and share a broad diversity of design ideas".

You don't need any particular materials to draw visualisations.

After you have quickly sketched a few ideas, spend a minute assessing them. Which one do you want to explore further? Now you can begin work on your ‘finished’ product (which may just be for you or could be aimed at future readers). Consider whether you want to use tools like a ruler or a compass to make your work more precise, or if you want to leave it fluid.

Think about your use of colour. Without the computer, you don't have any pre-defined colour palettes to choose from, so you may need to be more intentional. On the other hand, maybe you only have five coloured markers, so those will be your colour choices.

Try this exercise first with a very small dataset. I recommend fewer than 10 rows, but at least two variables. Constraints breed creativity, so having to focus on just a few values will make you stretch more for your set of ideas.

Also, if you need to make a mark for each observation in a dataset, remember that the larger the dataset, the longer it takes! Some handmade visualisations capture many values beautifully, but that can require a more advanced skill.

Take inspiration from Tukey and use tracing paper to trace a computer-generated graphic.

When I have done this exercise with college students, I’ve used that college’s racial demographic data, published by the Office of Institutional Research. This dataset leads to questions of categorisation (do they let students select multiple races, or do they simply select ‘multiracial’? Why are international students categorised separately?), which made their way into my students' work.

At SRCCON 2018 in Minneapolis, I brought data about the city’s lakes. If you are looking for small datasets, Wikipedia can be a good resource. As you get more experienced visualising by hand, you may want to try larger datasets. That's fine! In fact, I think using a combination of computer tools and art supplies is an interesting marriage. You can use the computer to summarise the data for you, but then do the mark-making on paper.

If you want to create a precise visualisation, but also give it a handmade feel, take inspiration from Tukey and use tracing paper to trace a computer-generated graphic. This can also be useful when you want to draw something which might be beyond your artistic skills (say, an outline of the United States).

Student work from Dr Amelia McNamara's undergraduate data visualisation course SDS 136: Communicating with Data.

Throughout this piece, I have been using the terms hand-drawn and handmade interchangeably. It is likely that you want a two-dimensional product of your visualisation if you will be sharing it electronically, but if you have the flexibility for something three-dimensional, try adding in more of the craft supplies I mentioned above.

At SRCCON, participants played with the dimensionality of Minneapolis' lakes, creating 3D visualisations out of paper and wire.

You could go even further, by doing data physicalisation or data visceralisation, to engage more of the senses.

Work like Data Cuisine by Moritz Stefaner and Susanne Jaschko, or Eat the data, smell the data from the School for Poetic Computation take the idea of ‘visualisation’ into other dimensions.

Conclusion

No matter how you decide to work, the practice of hand-visualising can bring you closer to your data. In Numbers in the Newsroom, Sarah Cohen suggests you should "memorize some common numbers on your beat”.

What better way to familiarise yourself with those numbers than physically mapping them on paper? This is also a great pedagogical practice, because when students begin visualising data they can get caught up in the technical details of a computer tool, without grasping the underlying connection between data and visual. Removing the computer makes it more concrete.

Of course, I'm not the only one who thinks this. Stefanie Posavec, of Dear Data leads workshops on handmade visualisations, and the companion book Observe, Collect, Draw! is a guide to help you do just that.

I have also drawn inspiration from Jose Duarte, whose Handmade dataviz kit provides a good starter pack of materials to jumpstart your creation process. Duarte's work is again physical and playful, and he encourages everyone to try it for themselves.

Whatever you do, I believe bringing handmade data visualisation into your practice can lead to more whimsy and fun in your life. Even if your hand-drawn visualisations never make it to print, they can help you think creatively about what you produce digitally. And, in the right context, a handmade visualisation could make your data more accessible to a broad audience.

Inspiration round-up

Dr. Amelia McNamara is an assistant professor of statistics in the department of Computer & Information Sciences at the University of St Thomas, in Minnesota. Previously, she was a visiting assistant professor at Smith College, in Massachusetts. Alongside standard statistics classes, she teaches classes on data journalism, data visualisation, and general data communication. She is an international keynote speaker and researcher at the intersection of statistics education and statistical computing, with the goal of making it easier for everyone to learn and do data science.

]]>
Making numbers louder: telling data stories with sound https://datajournalism.com/read/longreads/data-sonification Wed, 10 Mar 2021 07:30:00 +0100 Duncan Geere Miriam Quick https://datajournalism.com/read/longreads/data-sonification For years now, the gap between traditional reporting and data journalism has been shrinking. Today, statistics and figures are a common sight in reporting on almost any issue -- from local politics, to health, to crime, to education, to arts and culture.

But over-reliance on data can sometimes lead reporters to neglect the human angle of their stories. Without a direct link to people’s lived experiences, a story will feel flat and characterless. This is most often solved by bringing in photography, video or perhaps an interview. But over the past year, we’ve been exploring another approach.

Since before the beginning of recorded history, people have been telling stories with music. Music uplifts us, consoles us, excites us, scares us, and energises us. Like the best journalism, it speaks to the heart as well as the head. We’ve spent a large part of our lives immersed in music -- studying it, playing it, listening to it - and we’ve seen its powerful effects first-hand.

That’s why we believe that sonification -- the practice of turning data into sound and music -- can be a powerful tool for getting audiences to engage with a story on new levels, and even reach new audiences entirely. While your editor might not be best pleased if you turn in a concept album in place of your next reported assignment, we believe that more people should be experimenting with sound and music in their data stories.

Sonification in data storytelling is still in its infancy, but in this article, we’ll present some recent examples of journalists, scientists and civil society using data-driven sound and music to amplify their stories, talk about what we’ve learnt while making sonifications over the past year, and present Loud Numbers -- our data sonification podcast.

Sound matters

Sonification is all around us -- we hear it every day, from the “bleeeeep” of a microwave oven finishing its cycle, to the “bloop-blip” of a new message arriving on your phone, to the “crrrnk” of an error message on your laptop. It’s also a staple of TV and film storytelling, with the “beep beep beep” of an operating theatre’s EKG machine, the “pong!” of a sonar in a submarine, or the “bing” of an arriving lift providing an essential sense of place.

What’s interesting about all of these examples is that we have a very clear association with each of them, despite not actively thinking about what they mean. When you hear the sound of a message coming in on someone else’s phone, for example, you instinctively reach for your own. Our brains are wired to respond to certain sounds without thinking.

How can we take advantage of this, in an ethical way, for data storytelling? One approach, in a journalistic environment, might be to use these instinctive associations to convey emotion along with data through sonification.

Data sonification, in its simplest sense, refers to the act of turning data into sound. It includes all the examples above, but it excludes a few things too -- speech isn’t sonification (that would be too broad a definition to be useful), and neither is morse code or other systems where characters are encoded rather than data.

The pioneers of data sonification

Some of the earliest work in sonification was done by an earthquake researcher named Hugo Benioff more than 70 years ago. Benioff originally wanted to be an astronomer but switched careers when he discovered that astronomers sleep during the day and work at night. He joined Caltech’s Seismological Laboratory in 1924, and in 1932 invented the seismograph, which records tectonic activity on a roll of paper. Variants of Benioff’s original seismograph are used today all over the world.

In his spare time, however, Benioff had another hobby -- making instruments. He developed an electric piano, violin and cello, working with famous musicians to refine his designs. In 1953, he was able to finally combine his passions -- providing a series of audio recordings of earthquakes for one side of an LP record titled “Out of This World”.

But there was a problem. The human hearing range is roughly 20 Hz to 20 kHz, well above the frequency of many earthquake signals. To raise the pitch into the range of human hearing, Benioff recorded the earthquake data onto magnetic tape, then simply sped it up. The resulting tapes allowed people to safely hear and experience the Earth in motion for the first time. A liner note on the album reads: “Skipping [of the needle between grooves] is intentional and indigenous to the nature of the subject matter.”

This approach of time-stretching and pitch-shifting data to bring it into the range of human hearing is called “audification”, and it’s the first of five categories of sonification suggested by German sonification researcher Thomas Hermann in 2002. A more recent example might be “Sounds of the Sun”, which was published in 2018 by NASA using data from its solar and heliospheric observatory, which was sped up by a factor of 42,000 to bring it into the range of human hearing.

Hearing the impact of a story

Another relatively simple sonification approach is to play a sound when something happens. These sound “labels”, can then represent events in the same way that a “bloop-blip” on your phone represents an incoming message.

In 2014, Tactical Technology Collective used this approach to raise awareness of the problem of the sudden collapse of residential buildings in Egypt due to poor construction regulations. In their sonification, Egypt Building Collapses, data on these collapses is represented as the sound of a building collapsing.

The data shown above includes the number of dead, injured and homeless along with the reasons for these housing collapses and which governorates they occurred in. The data covers July 2012 until June 2013, a year that did not see large natural disasters.

The creators tried several different approaches before settling on this one. “The first was calmer and softer; it could be described as ‘relaxing’ or even ‘meditative’”, the authors told Gabi Sobliye and Leil-Zahra Mortada in 2017. “The second test featured more literal sounds of falling bricks and unstable foundations.” They chose the latter, to connect the data with the strong emotional resonance of hearing a building collapse. These two distinct approaches -- an abstract sound, versus one that sounds like the thing it represents -- are referred to by Hermann as “earcons” and “auditory icons” respectively.

Abstract earcons might be pings, boops and dongs, designed either to sound good or intrusive depending on your storytelling goal. In 2010, The New York Times published “Fractions of a Second”, a piece that uses an “earcon” approach to allow listeners to hear the difference between a gold and a bronze medal in various Winter Olympics events. In 2017, the same newspaper created a similar sonification of the rate of fire of different weapons, in the wake of the Las Vegas and Orlando mass shootings. In both cases, the experience of listening to the data delivers far more than reading the same numbers ever would.

Auditory icons, where choice of sound is connected to the data, can deliver even more emotional weight. The collapsing buildings of Egypt Building Collapses leave the listener in no doubt as to what the data represents. But this can sometimes be too much. If The New York Times had used actual gunfire recordings to sonify the rate of fire of the different weapons, it could have come across as disrespectful or in poor taste. The strong link between sound and emotion makes it important to use care and subtlety when creating sonifications on sensitive issues.

Audio choice

The final two categories of sonification identified by Hermann can be combined under the heading of parameter mapping systems. This is where parameters of a sound -- its pitch, volume, duration, tempo, and so on -- are directly mapped to data. High numbers are represented by higher notes, for example, or perhaps by louder notes. Or both!

There’s a real art to creating effective parameter mappings, and in most cases you won’t know whether something works until you try it. The sheer range of sound properties that you can interact with, as well as their combinations, make it a highly exploratory process. Some mappings are continuous, while some are discrete. Some are obvious, while some are subtle. Many connect emotionally with the listener in interesting ways -- for example, choice of instrumentation, the tempo of a track, or even the musical key are all highly emotionally charged.

The choices that you make around what sound parameters are mapped to what data make up your system. That system can be simple, with just one or two mappings, or it can be very complicated. In either case, it’s vital to be sure that your audience understands what’s going on. That might mean explaining the system beforehand, playing the different sounds that the audience will hear in isolation and explaining what they mean. Or it could be done through interactivity -- allowing your audience to turn different parts of the audio on or off, or even control the volume.

One example of effective parameter mapping in journalism is Reveal’s sonic memorial to the victims at Orlando’s Pulse nightclub.

In the latter, the lives of people killed in the shooting are represented by different bell tones, which end abruptly in June 2016. This would not make a particularly interesting visualisation, but the choice of bells as instrumentation, as well as the slow tempo, give it a strong emotional weight. “They are meant to be funereal but also celebratory,” writes creator Jim Briggs. “We find music in the moments when tones, like lives, intersect.”

Listening to the data

Over the past year of working on sonifications, we’ve found some approaches to be more effective than others. For starters, we’ve learned a lot more about which kinds of audio mappings work best with which kinds of data.

Musical pitch is often used in data sonification to communicate quantity, where notes of a higher pitch signify more of something (people, money, cars, onions). Pitch is the most common mapping used in sonification – it’s the default in free sonification app TwoTone, for example — and it’s a very useful one. Our ears are generally good at detecting small pitch shifts, which is why out-of-tune singing sounds so awful. Tiny pitch differences can effectively communicate tiny differences in underlying data values.

Pitch can slide, making it great for data that varies continuously, like height across a population or CO2 levels in the atmosphere. However, pitch slides can sound a little menacing, which is appropriate for a climate change sonification but perhaps not in all storytelling situations. An alternate approach is to use pitch in the discrete steps of a scale, which is best used to communicate data that forms ordered categories, like income brackets or age groups. In practice, most sonifications that use pitch tend to use musical scales to communicate continuous data, probably because they sound nice, as in the Financial Times’ sonification of the yield curve from March 2019.

But pitch can be ambiguous. Mapping higher pitches to larger quantities makes sense on one level: in data visualisation, higher typically means more, as in a bar chart or line chart, and in music notation notes of higher pitch are written nearer the top of the stave. Interestingly, however, this relationship does not correspond to the physical origins of pitch. In the real world, it’s smaller drums and shorter strings that make higher sounds, and larger drums and longer strings that make lower sounds. So a higher pitch could signify more or something, or it could just as easily signify less.

Fortunately, there are lots of other options. You can map quantity to relative loudness (so louder equals more), number of instruments (more instruments playing equals more data), EQ (bassier equals more), duration (longer sound equals more) number of sounds in a time period (more sounds equals more), tempo (faster equals more) and so on. Or you can double or triple-map data to more than one of these.

All of these audio parameters have unique qualities and also distinct limitations. Loudness can have a powerful emotional effect on listeners but lacks precision: it isn’t brilliant for communicating fine differences. It is also very affected by the soundsystem used for playback (especially the volume level and the bass response of your audience’s speakers) and by audio engineering techniques like compression. So we believe it is best used when double-mapped with another audio parameter like pitch or instrument, or coupled with visuals.

The BBC visualisation looks at the scale of the COVID-19 death toll. The length of each flower's stalk corresponds to the number of confirmed coronavirus cases in the country.

In this BBC visualisation, for example, the loudness of the music corresponds to the number of daily COVID deaths globally, but the visuals do most of the detailed data communication work, with the sonification serving as an impressionistic enhancement to them rather than something that you’d listen to for information on its own.

Instruments with different sounds lend themselves naturally to communicating data categories. In Listen to Wikipedia, a real-time sonification of Wikipedia edits, bells indicate additions and string plucks represent subtractions. Pitch changes according to the size of the edit: the larger the edit, the deeper the note. And a string swell means a new user just joined Wikipedia.

This sonification by Jamie Perera uses the number of sounds in a time period to communicate UK COVID deaths between March and June last year. Each sound represents one death, and 30 seconds of sound represents one day of data — the entire piece lasts a harrowing 55 minutes. This scale humanises the numbers, the sonic equivalent of using one dot to represent one person in a data visualisation.

Finally, one added benefit of sonification, when compared to traditional data storytelling, is that it’s more accessible to the blind and partially sighted. Over the last decade, the technology to create snazzy interactive data visualisations has consistently outpaced the technology to make those visualisations accessible to everyone.

In this regard, sonification is a tremendously useful technique. Adding a sonification component to your data visualisation can not only boost accessibility for people who are blind or partially sighted – here’s a great example of a sonification targeted primarily at that audience. But it also improves the experience for fully-sighted users who find the audio feedback helpful. As with other common accessibility initiatives, more people benefit than is often expected.

Data sonification tools and resources

Sounding it out: data storytelling

Over the past year, we’ve been working on a collection of data stories told through the medium of sound and music, using many of the techniques detailed above. It’s called Loud Numbers, and it’ll be released later this year as both a podcast and an extended play of music available on all good streaming services.

Our goal is to create something that not only tells a series of compelling data stories but is also a pleasure to listen to. We wanted to know if we could hit a sweet spot where we communicate the story and also make something that sounds beautiful, something you'd press play on more than once.

We’ve found that there’s a lot of creative space to explore in the triple point between stories, data and music. Each of our sonifications nods to a different genre of music while telling a related data story. For example, we’ve got old-skool jungle music soundtracking a story about inequality in the United States, the history of EU legislation presented as baroque counterpoint, and the media frenzy around Brexit sonified using the “news music” so beloved of TV news journalism.

Ultimately, we hope that by developing Loud Numbers we can push at the boundaries of what’s possible in data journalism and data storytelling. We believe that sound and music have the power to not only reach new audiences, but better serve existing audiences by deepening their emotional connection to the story. Next time you’re working on a data journalism project that would benefit from a little extra emotional connection, why not give it a try?

Duncan Geere is an information designer, writer, editor and data journalist with more than 10 years of experience. He covers the intersection of science, technology and culture, with a particular interest in environmental issues. He recently completed a stint as a senior editor & creative producer at Information is Beautiful, and I've done freelance work for companies like SAP, Storythings, FutureEarth, Drawdown, and publications like Wired, the Guardian, BBC Wildlife, BBC Science Focus, Technology Review, Popular Science, and Techradar.

Miriam Quick is a data journalist and researcher specialising in information visualisation. She finds facts and data and builds them into compelling stories. Then she works with designers to produce information graphics, data visualisations, data artworks and installations. Her first book, I am a book. I am a portal to the universe., written with Stefanie Posavec, was published by Particular Books (Penguin UK) in 2020. She holds a PhD in Musicology from Kings College London. Her thesis focused on performance style in recordings of the works of Anton Webern. This research involved analysing data from sound recordings and using software to visualise the results, which sparked my interest in data visualisation.

]]>
The promise of Wikidata https://datajournalism.com/read/longreads/the-promise-of-wikidata Wed, 10 Feb 2021 09:05:00 +0100 Monika Sengul-Jones https://datajournalism.com/read/longreads/the-promise-of-wikidata A decade ago, let’s say you wanted to know the population of the metropolitan area of Accra, Ghana using the open web. For a quick answer, you might look at a Wikipedia article’s infobox on Accra.

There would be a number. Let’s say you used another language version, French Wikipedia. You might get a second number. With a search engine, a third number.

Each might be correct, contextually. Of course, people are born, die, move, and boundaries are being negotiated. Any population statistic is bound to be out-of-date from the moment it’s collected.

But the variation—and lag—in updates on data points like population across Wikipedias, not to mention elsewhere on the web, has frustrated open access semantic web advocates, working for more machine-readable linked data online. Because the inconsistency isn't an issue of controversial records or out-of-date datasets. Rather, it's a problem of unlinked data.

Enter Wikidata, in 2012. A machine and human-readable linked knowledge base that straddles the best of both. Humans can edit. Machines can read.

Update the population of a city in Wikidata; insert the linked identifier into article pages—and bada-bing, bada-boom–when the linked database is updated, all the identifiers running the information also update. The population of Accra, Ghana can be consistent no matter where you look.

Wikidata is a sister project to the better-known crowdsourced encyclopedia, Wikipedia—which has benefits for data journalists.

And both are part of the Wikimedia movement, whose mission is to bring "free educational content to the world."

But unlike Wikipedia, which at 20 years old is recognised for being surprisingly reliable despite predictions that the end of Wikipedia is near, Wikidata is best known for having a promising future—that hasn't quite arrived. Illustrated, for example, by the fact that Accra's population is still unlinked on Wikipedia (the English Wikipedia article's infobox references the census, not Wikidata).

Computers are often very fast, but generally very dumb, so in a system of explicit representation of information, you have to tell them everything.

How can data journalists use the Wikimedia movement's linked knowledge base data?

The first step is discerning the difference between the promise of the project and how it works today. Elisabeth Giesemann, from Wikimedia Deutschland, recently gave a talk for journalists and explained that Wikidata is actualising the vision of a semantic web touted by Sir Tim Berners-Lee, creator of the world wide web.

Berners-Lee similarly champions Wikidata. He co-founded the Open Data Institute, which recently honoured WikiData with a special award.

Though there’s evidence that Wikidata is already ushering in a new era of linked data—with the dataset being incorporated into commercial technologies such as Amazon's Alexa and Google's knowledge graph—there are limitations, including whether or not web pages link to Wikidata.

The project suffers from the biases and vandalism that plague other Wikimedia projects. Including gender gaps in the contributor base—the majority of the volunteer editors are male. And the majority of the data is from—and about—the Northern hemisphere. The project is young, Giesemann emphasises.

As a concept, one might compare Wikidata to a busy train station. There are millions of links between data points and interlinks to other open datasets.

Denny Vrandečić, who designed Wikidata and worked for Google for six years as an ontologist before joining the Wikimedia Foundation last summer, said Wikidata connects to 40,000 other databases, including Wikipedia, DBpedia, Library of Congress, German Bibliotech, and VIAF.

“[It’s] a backbone of the web of data,” said Kat Thornton, a researcher at Yale University Library with expertise in linked data. “If you are interested in data, it is better than search. It would be oversimplifying Wikidata to call it search. [There] you are matching the string, Wikidata’s web of knowledge is far more powerful than string matching.”

A cropped visualisation of Wikidata's position in the linked open data, which according to Kat Thornton, a researcher at Yale University Library, is the backbone of the web.

How Wikidata is designed to work

Like other linked data projects, Wikidata models information using the Resource Description Framework (RDF). This model expresses data in semantic triples. Subject --> predicate --> object. For example, Accra is an instance of a city.

An item can be a definite object—a specific book, person, event, or place. Or a concept—transgender, optimism, or promise. Items are identified with a “Q” and a unique number.

Accra is Q3761. City is Q515. Predicates are labelled with a “P” and a number. Instance of is P31. Relationships between items are stored as statements. Accra is an instance of a city. Q3761-->P31-->Q515.

“Computers are often very fast, but generally very dumb, so in a system of explicit representation of information, you have to tell them everything. Tokyo is a city. The sky is up. Water is wet,” wrote Trey Jones, a software engineer for the Wikimedia Foundation, in a recent article on linked data.

Diagram visualising linked data in triples, using the item for Accra, Ghana as an example.

However, many data are factually contentious. When is someone dead? Most births and deaths are cut and dry—unless you are Terri Schiavo.

How about Taiwan? The instance of a sovereign country or the territory of another country like Sudan, Palestine, Crimea, Northern Ireland. The list goes on (and it does).

Helpfully, Wikidata allows for ambiguity. Taiwan is an instance of a country. Taiwan is an instance of a territory. Both statements can exist in the item.

There’s also room for references, which enable the reuse of linked data by improving error detection. But contributors can make statements without them.

This lowers barriers to entry for data donations and new contributions, but can mean that "inaccurate data, or messy database imports, such as peerage or vandalism, are a challenge," said Jim Hayes, a volunteer Wikidata contributor and Wikimedia D.C. member in an email interview.

As a concept, one might compare Wikidata to a busy train station. There are millions of links between data points and interlinks to other open datasets.

As a result, there are partialities in the available linked data. Some organisations have already donated data.

There are 25,000 items for Flemish paintings in Wikidata, thanks to 2015 data donation from a collective of Flemish museums and galleries. But other topics—such as the cultural heritage of nations in the Global South—are left wanting.

This is a topic Mohammed Sadat Abdulai, a co-lead of the non-profit organisation Art+Feminism and community communication manager with Wikimedia Deutschland, deals with daily.

He said in a recent phone call that Wikidata’s eurocentrism can materialise not only through the presence or absence of data, but also in the subtle way that data is modelled.

“If you come from a different way of thinking, you will find it is difficult to model your way of thinking with Wikidata,” he said. He gave the example of name etymologies. “There are Ghanian names in Dagbani that are meaningful through their connection to days of the week,” said Abdulai. “Atani means a female born on a Monday. But this meaning is not easy to model using subclasses in Wikidata.”

Abdulai strives to expand representation of Ghana with Wikidata, but his experience suggests there can be linguistic ghettos. “It is a good thing Wikidata is flexible,” he said. “You can find your own way of modelling. But since it is not conventional, you end up working in your own little space.”

Quick links

An introduction to Wikidata

Donate data to Wikidata

Probe differences in titles and coverage to understand possible regional differences and sources on the topic. This screenshare video uses the article on the 2020 women’s strike against Polish abortion law as an example. It demonstrates how to find the Wikidata link and access multiple language versions of a Wikipedia article.

Three ways data journalists can bring Wikidata into their data storytelling

1. As a shortcut between Wikipedia versions

One easy way to use Wikidata is as a node of connection between Wikipedias, which you can also use for data journalism. The Wikidata item is the node of connection between articles in different language Wikipedias. This shortcut can help you check for variation in existing coverage, and pry for new angles.

For instance, the 2020 women's strike in Poland against the abortion law has articles in 17 Wikipedias in different languages, each a slightly different version providing coverage of the strike. To quickly dig into details of variation on Wikipedias, and the narratives they index, use Wikidata to toggle between article versions.

If we’re going to call other sites untrustworthy, we can’t just say “trust us” as the reason why.

2. Use Wikidata at scale. The API is available and the interface is multilingual.

This requires caution, however. Barrett Golding can attest. When the pandemic hit last year, Golding—a former NPR producer and freelance data journalist—launched Iffy.news. Designed for researchers and journalists, the site contains indexes and lists on sources of mis/disinformation.

Golding uses large-scale data harvesting to showcase whether or not a website has a reputation for fact-checking, based on credibility rankings from databases such as Media Bias/Fact Check. “If we’re going to call other sites untrustworthy, we can’t just say “trust us” as the reason why. So each Iffy site links to the failed fact-checks that make that site unreliable,” Golding explained.

More recently, he began accessing information from Wikidata and Wikipedia to cross-check the reliability of websites and online sources, thanks to a grant from WikiCred. (For disclosure, I am also working on a project on reliable sources and Wikipedia funded by Wikicred).

That’s where things fell apart. The data were too piecemeal.

Infowars, a well-known instance of “fake news,” is described as such in its English Wikipedia article. Wikipedia editors have also blacklisted the website from being used as a reliable source in citations, according to the hand-updated list of Perennial sources.

But these classifications didn’t make it to Wikidata. Infowars was just an instance of news satire and a website. (That is, until two weeks ago, when the item was edited to include as an instance of “fake news”).

The takeaway for data journalists? Be aware that large-scale data harvesting from Wikidata’s API can scrape out nuance at scale, rather the other way around.

Quick links

How to contribute to Wikidata

Wikidata API for Python

Iffy.news

Wikipedia Perennial Sources

Leverage Wikidata's strengths

There are Wikidata storytelling success stories, which often include using the dataset in conjunction with other data. Laura Jones (no relation to Trey or the author), a researcher with the Global Institute for Women’s Leadership at King's College London authored a report that shows how women—journalists and experts—have been involved in coronavirus media coverage.

To find out, the study used Wikidata and Wikipedia’s API to identify the gender and occupation of 54,636 unique people who had been mentioned in a vat of news content sourced from Event Registry's API, an AI-driven media intelligence platform, during the 2020 pandemic.

Thanks to the information stored in Wikidata, Jones was able to identify most of the unique people mentioned in the news coverage – experts and journalists. The majority of media coverage about the pandemic was written by male journalists, while one out of five expert voices who were interviewed about the pandemic were female, Jones concluded.

Science Stories.io uses Wikidata and other linked data projects to visualise stories about women in science and academia. By aggregating images, structured data, and prose at scale, Science Stories.io generates hundreds of multimedia biographical portraits of historical and contemporary notable women.

While Scholia pulls data from Wikidata to create visual profiles on items including chemicals, species, and people.

Quick links

Covid Media Analysis Report

Science Stories.io

Scholia

3. Discover relationships through Wikidata's query service

Get a sense of what’s in Wikidata, and how this may aid your data storytelling, through querying. The Wikidata query service is free and available online. You’ll need to use SPARQL, a variation of SQL, the relational database management system.

Whether you are already familiar with SPARQL or just getting started, there are an abundance of tutorials and training videos to learn. The query service also has examples. Run an example yourself for fun by pressing the blue “play” button. There are also volunteer users who are willing to run queries for you.

Wikidata query service includes example queries.

You can modify examples or write a query from scratch. Query results can be visualised, shared, downloaded, or embedded. It's worth running a query before you use the API or download a data dump.

When it comes to an effort like Golding’s project on “fake news” websites, the query could be the first red flag that the data just isn’t there. For instance, a query for instances of “fake news” websites in Wikidata reveals less than a dozen.

Try it—and keep in mind the results from your query will be as of the date you run the query, not as of mine, nor as of Goldings. Right now, there’s no way to share a hyperlink to a historical version of a query). Part of the problem, as Golding found, are idiosyncratic classifications. Some items are instances of websites, others are online newspapers.

Another query example is a “Timeline of Death by Burning.” (Try it). I modified the query by substituting the cause of death (P509) from burning (Q468455) to decapitation (Q204933). (Try it). Both rendered grisly timelines showcasing a long history of these particular forms of death.

My next modification reveals a limitation of the dataset. I wanted to create a timeline of women who are murdered in gender-based violence. This has a name, femicide (Q1342425). Again, grisly, I know—but also important. I expected the number might be lower than reality. But I was not prepared for a timeline of one. One femicide. (Try it).

Liam Wyatt, who manages the Wikicite programme for the Wikimedia Foundation, said this is a typical pitfall. “You have to caveat any query result with 'as far as Wikidata knows,'” he explained in a phone interview.

For my query, it’s possible there are other femicides documented in Wikidata. Categorised as instances of homicide, or domestic violence. But there is invisibility yet. For instance, the murder of Pınar Gültekin by her ex-boyfriend last year made headlines around the world. Women took to the streets in Turkey to protest.

And there was a much-debated social media hashtag campaign, #challengeaccepted, to raise awareness about femicide.

While there is a Wikipedia article in English about the murder, and a Wikidata item for the event, Gültekin—not to mention the manner of her death—is not included as a human in Wikidata.

Quick links

Wikidata Query Service Tutorial

Gentle Introduction to the Wikidata Query Service

Request a Query

"Wikidata is now powerful and important, but still esoteric and incomplete,” said Wyatt, on the ambiguities of the current state of Wikidata. “It’s a bit of a wild west. Journalists who can get in on the ground floor, on this wave while it’s still picking up speed, they will really be in a position to ride the momentum.”

A promising project, when we remember that promise, according to Wikidata, is also known as liability. That is, to be “held morally or legally responsible for action or inaction.” Use it freely and be mindful to not substitute this dataset for news judgement.

As Last Moya writes in “Data Journalism in the Global South”, data can aid journalism in speaking truth to power provided “journalistic agency and not data is King.”

Monika Sengul-Jones, PhD, is a freelance researcher, writer and expert on digital cultures and media industries. She was the OCLC Wikipedian-in-Residence in 2018-19. In 2020, she is co-leading Reading Together: Reliable Sources and Multilingual Communities, an Art+Feminism project on reliable sources and marginalised communities funded by WikiCred. @monikajones, www.monikasjones.com

Thanks to Molly Brind'Amour (University of Virginia), Will Kent (Wiki Education Foundation), Lane Rasberry (University of Virginia), and Houcemeddine Turki (Wikimedia Tunisia) for speaking with me for background research for this story.

]]>
Privacy Day 2021: what journalists need to know https://datajournalism.com/read/longreads/privacy-day-security-guide Thu, 28 Jan 2021 00:00:00 +0100 Andrea Abellán https://datajournalism.com/read/longreads/privacy-day-security-guide Cybersecurity and digital privacy are major concerns for today’s journalists. As more journalists work remotely, the amount of time we spend online continues to grow. With online threats becoming more prevalent and sophisticated, we must understand how our data might be compromised and what to do to protect it. This is especially the case for investigative journalists who face more significant digital security threats given the sensitive information they handle. But the digital footprints we leave behind don’t just impact us professionally. Not adhering to digital hygiene best practises can also compromise us and our contacts personally. The good news is a tremendous amount of digital security resources, tools and information exist online to help safeguard you and your data.

Are you doing enough to protect your data?

To help you brush up on your digital security knowledge, we caught up with eight cybersecurity and privacy professionals to crowdsource their best tips for managing your data more securely. Read the edited Q&A below with our panel. Topics include:

  1. The biggest threat to internet privacy for journalists.
  2. Genuinely investigate anonymously online.
  3. Aspect of privacy for media professionals to safeguard.
  4. Keeping your personal information secure using cloud storage and sharing.
  5. Must-have privacy tools for journalists.
  6. Advice for media professionals to protect their sources and information.
  7. Balance the promotion of their work with their online privacy and safety.

Our panel:

1. What is the biggest threat for journalists when it comes to internet privacy?

Viktor Vecsei (IVPN): The biggest threats journalists face vary – it depends on which country they live in, issues they focus on and the type of adversaries they might face. Each person’s situation is unique. At least basic levels of privacy protection measures must be in place to avoid personal threats from readers disagreeing with their mission, harassment by government officials or getting targeted with disinformation campaigns. Two distinct areas are important to consider: protecting their identity when doing investigative work or research and protecting their personal privacy when publishing materials and disseminating them on social media. Each requires different tools and techniques and they need to consider how, what and when they access and share to minimise threats.

Sasha Ockenden (Tactical Tech): We are all immersed in technology and data – and the pandemic has only exacerbated this. The major privacy issue for all of us, including journalists, is how to compartmentalise our private and professional activities, and make sure that the tools we use for one do not affect the other. For journalists in particular, given the potential consequences of sensitive information being exposed, it is more important than ever to understand how data is collected, stored and (ab)used. The fast-changing nature of online tools and platforms, and the ways they are regulated in and across different jurisdictions, can make it hard to keep on top of. This is especially the case for those who consider themselves less tech-savvy. Tactical Tech’s Data Detox Kit provides clear suggestions and concrete steps to keep control of all aspects of your online life, make more informed choices and change your digital habits in ways that suit your private and professional lives.

Henk Van Ess (Journalist): The biggest threat is the journalist themselves. Those who say "I have nothing to hide" will inevitably be trolled, embarrassed, cloned, or worse: hacked.

Chris Dufour (Digital security consultant): There is no single "big threat" in terms of a specific piece of malware, hacking technique, or attacker. That's the threat: the internet is iteratively changing and evolving daily, sometimes hourly. As such, it can be virtually impossible to fully secure oneself, and even if you could, there are corollary vulnerabilities in the form of those around you and the information they share about you: your family, friends, coworkers. I believe the biggest threat is the individual's degree of skill and time spent securing themselves against the attack and undue influence.

Valentin Franck (Tutanota): There are several threats to privacy in today’s internet. Journalists are affected by those in particular because they are more likely to hold sensitive information than the regular internet user. First of all, large parts of the internet are tracked by private companies with the primary objective of user profiling in order to sell targeted advertisements. The amount of information gathered by those companies is enormous. The exposed position of journalists and the fact that they can be multipliers means that it is interesting to learn about and shape their thinking and interests for a wide range of actors. Also, state actors might force private companies to help gather information on a person of interest.

Let's get you started.

This is Data Privacy Day 2021 and at DataJournalism.com you can enjoy free and discounted products when you sign up as a member. We are proud to bring you new offerings added to our goodie partners: a 1-year free subscription with Tutanota and a 6-month free secure VPN subscription with IVPN. This is in addition to our existing privacy-related partners who have generously created offers for our community: 1Password, Fastmail and Flokinet.

Sign up now, it's FREE!

There are no tools or services that can guarantee total anonymity.

2. Can journalists genuinely do their work anonymously online?

Henk Van Ess (Journalist): Not completely. But with Surveillance Self-Defense, a resource developed by Electronic Frontier Foundation, you'll be able to attempt it.

Viktor Vecsei (IVPN): Complete anonymity is hard to attain online – no single tool or technique can give you that 100% protection. To achieve protection and gain peace of mind, journalists need to accept this premise and target the best level of anonymity in every situation. A combination of tools that require no personally identifiable information to get started – such as secure and encrypted messaging, Tor or a trusted VPN service, no-logs email provider – can give them a reasonable edge against detection by unwanted eyes and ears. Journalists need to keep their threat model in mind (what’s the worst that can happen? what capabilities do my adversaries have?) when deciding on the toolkit they use to mask their identity when working with sensitive information. In straightforward cases, a simple checklist of basic errors one should avoid to get tracked down could be enough. In situations where their lives could be at stake, they need to invest time and resources into proper security preparations to protect their anonymity to the highest extent possible.

Laura Tich (SheHacks_KE): There are no tools or services that can guarantee total anonymity. Total anonymity would mean not just hiding your online persona but also your device and the services you’re accessing. To achieve even a bit of anonymity, you would need to put a lot of measures in place. For example; using an avatar instead of your real identity, using a VPN to encrypt your connection and hide your IP address, changing your Mac address to mask your device etc. By putting these measures in place, you can achieve some level of anonymity and make it difficult to track your identity. It is however important to note that services such as VPNs are not impenetrable and they can still be compromised. Always do your research before using some of the privacy and security tools available.

Sasha Ockenden (Tactical Tech): Total anonymity is difficult, if not impossible, to achieve: we see it above all as an ongoing process which can be achieved to a certain level for a certain length of time. Journalists for whom anonymity is particularly important can use tools like the Tor browser to mask their identity online. Put simply, Tor separates the information that identifies your computer from the web pages that you are accessing. You can find a more detailed guide to using Tor on different operating systems in Security in a Box here.

Valentin Franck (Tutanota): As far as we know, tools like Tor provide good anonymity that give even powerful agencies like the National Security Agency a hard nut to crack. While there are known technical deanonymisation attacks against Tor, those are hard to carry out in the real world.

Designed by Easelly: https://www.easel.ly/journalism

3. What aspect of one's privacy should media professionals safeguard when working online?

Laura Tich (SheHacks_KE): Your Personally Identifiable Information (PII) is the most critical piece of data that needs to be safeguarded at all costs. This is mostly because, PII is who you are and once this data is out there, it leaves you open to a lot of various attacks from identity theft to access to your personal accounts. It can also lead to attacks against the people close to you. As a journalist, you need to prioritise your needs by considering the following: the areas of your work that create additional risks, sensitive information which your adversaries may find useful and the impact certain attacks would have on you or your organisation. With this in mind, you will have an idea of what aspect of your online security you should prioritise.

Henk Van Ess (Journalist): Use the web as if your screen is 24 hrs a day, visible on a giant screen in the middle of your town. Would you still say: I have nothing to hide? Know at least your shadow. Minimise harm: this SEC-article from 2005 is still relevant after 16 years, as are these 16 tips.

Sasha Ockenden (Tactical Tech): There are numerous elements which media professionals should safeguard when working online, including their contacts, location data and digital habits. This applies to all devices and accounts they use, as well as the web pages they visit and the platforms they use to communicate. Equally, it is easy to overlook the importance of the digital practices of those you are working with and sources (who may often be even more vulnerable): if they are not safeguarding their data properly, your own privacy and security will be at risk, no matter what you do.

Nicola Nye (FastMail): When you're working, you want to have your information at your fingertips in an organised fashion so you can find what you need easily. You also want to make sure that the emails you're exchanging are kept private and won't be used by third parties to sell you things you don't need. It's worth finding an email provider, where usability isn't sacrificed for privacy. Keep your account safe from hackers by using two-factor authentication, build strong passwords you don't need to remember by using a password manager and register your login at haveibeenpwned.com to be notified if your account credentials have been leaked on another site.

Journalists are placing their safety in the hands of those running privacy protection services. Before use, always verify if the developers are trustworthy.

4. What are some suggested tactics for keeping your personal information secure when using online services such as cloud storage and file-sharing systems?

Laura Tich (SheHacks_KE): One of the ways you can keep your data safe is by ensuring that access is restricted to only you or if it’s a shared drive, only authorised people can access the files. Strong, complex passwords are important. Having multi-factor authentication will add an extra layer of security. This is important not just for cloud storage but your online accounts and devices as well. If for example, a hacker cracks your password, they would still need a code or a yubikey in order to access your file. Another way of protecting your cloud data is by checking your connected accounts and apps. In many scenarios, attackers may not try to directly access your cloud storage but they will leverage apps or accounts that are connected to your cloud account. If you are using a Google account, you can check your linked accounts here. Also, make sure your device is also protected. Store your physical devices safely and also have measures to prevent unauthorised access. In case your device is stolen, go to your cloud settings and deactivate that device.

Henk Van Ess (Journalist): The best way to protect your personal information and keep it secure is to not use those services in the first place. By not using those services. Law enforcement and hackers can try to unlock your data in the cloud. Build your own cloud server, a personal one, or get a pod.

Chris Dufour (Digital security consultant): The best method is don't use your personal information ANYWHERE. In almost every setting, there is no reason to share personal details about yourself anywhere online. Use false names, burner phone numbers, non-attributable email addresses, and VPNs whenever possible and always browse with a secure, privacy-oriented browser. If you're a journalist, your organisation should be investing in non-attributable tools and practices to protect your information. Need a secure Dropbox folder to share data? Great! Use your organisation's name and a non-attributable number for each user, not your own name. If your org is not investing in these things or has a security manager skilled enough to help you figure it out, petition your supervisors to hire a reputable digital security consultant to do it for you and train you to do it well in perpetuity.

Valentin Franck (Tutanota): It is advisable to get some information about a service, especially on what it does to protect the users’ privacy because there are huge differences between different services. This information is usually accessible in the privacy statement and if it is a secure service there will usually also be some explanation on how cryptography is used to enforce user privacy. It is recommended that journalists only use online services that provide end-to-end encryption. For instance, an end-to-end encrypted cloud service will only see that the user uploaded some data and who else has access to it. However, the only ones able to read or modify the contents of a file are those explicitly authorised. An alternative to using existing end-to-end encrypted online services is to host your own service. Of course, this requires some technical skill to be done securely, but there are a number of privacy-friendly self-hosted solutions. Nextcloud, for example, is a great cloud collaboration platform that cannot only be used for file storage and sharing but also to create polls and organise teams.

Journalists need end-to-end encrypted e-mail service to communicate with their contacts.

5. What are your must-have privacy tools for journalists?

Viktor Vecsei (IVPN): We recommend starting with the following checklist:

  • Secure and anonymous file-sharing tool they can receive sensitive materials through without compromising the identity of their sources (e.g. OnionShare or SecureDrop
  • Tor or VPN to hide their IP address and encrypt their connection
  • Secure, encrypted email provider that offers the option of turning logs off (e.g. Tutanota or ProtonMail
  • Encrypted messaging app that keeps and shares no data (and metadata) on your conversations (e.g. Signal)
  • A password manager that helps with generating and managing secure, distinct passwords (e.g. KeePass or Bitwarden)

Journalists are placing their safety in the hands of those running privacy protection services. Before using any of them, one should always verify if the developers are trustworthy and follow information security best practices. The best approach is soliciting recommendations from knowledgeable sources they trust.

Henk Van Ess (Journalist): It is wise to use VPNs that work with not so mainstream protocols like Azire or Wireguard. Enable DNS encryption and use 1.1.1.1.

Chris Dufour (Digital security consultant): I always recommend using the following:

  • A reliable VPN that has been well-reviewed by a third-party security researcher
  • A hardened internet browser that allows you to turn off cookies and scripts when desired, multiple devices for different purposes or identities (e.g. a phone for work and a different phone for home life),
  • A secure instant messaging app like Wire or Signal, a service like Abine that allows you to anonymise as much of your digital identity as possible (e.g. Blur out credit card details).

Sasha Ockenden (Tactical Tech): In addition to the tools outlined so far, we recommend using a secure, privacy-conscious browser such as Tor Browser, Firefox, Chromium or Brave, with the following add-ons: HTTPS Everywhere (which makes websites use a more secure connection) and uBlock Origin (which filters content). For instant messaging, such as with sources, we recommend Signal; for sending emails Thunderbird with Open PGP; and to keep track of passwords (e.g. for contact databases) a password manager such as KeePassXC. Given the challenges of working remotely and increasingly online, we have published an article called "Technology is Stupid" with recommended criteria on how to assess digital tools, and why some may be more appropriate to use than others depending on the context, including a comprehensive list of tools.

Valentin Franck (Tutanota): Most importantly, journalists need end-to-end encrypted e-mail service to communicate with their contacts. A messenger app with a focus on privacy and security is Signal. This tool also allows you to make end-to-end encrypted video calls. Tor browser allows anonymous investigations on the web. At the same time, Tor can help circumvent censorship in some countries. Another must-have is a password manager that enables you to use secure random passwords for all of your accounts, while you only have to memorise a single strong password to access the password manager.

If you want to go one step further you should probably make sure your device and operating systems are secure. You should use system encryption and lock your devices with secure passwords or even use specific operating systems like Tails, whose goal it is to protect your identity and data online and physically. For further recommendations see Security In a Box.

Be judicious about promising confidentiality. Keep secrets secret.

6. What advice would you give to media professionals to protect their sources of information? Are they responsible for guiding their sources on how to stay safe online?

Naiara Bellio (Maldita Tecnología): When communicating with confidential sources it is best to use multiple devices so you aren’t associated with a specific device. For example, there are investigative journalists that travel with more than one cell phone and at least two computers. One is their personal device, which probably carries a heavier digital footprint, and the other can be an encrypted device or one that runs an operating system like Linux. At the very least, it shouldn't have personal accounts linked for email, messaging services or social networks. Investigators and journalists also work with these kinds of devices when testing the GAFAM range (Google, Apple, Facebook, Amazon, Microsoft) when connecting devices and the use of their algorithms.

Sasha Ockenden (Tactical Tech): Media professionals absolutely have a responsibility to keep their sources safe, as the source is often the person most at risk. They will often expect you to have an understanding of how to keep the information they are providing secure before you interview them – and this is key to building trust with them. To start with, it should be agreed with sources whether encrypted communication is legally, technically and practically possible (without attracting unnecessary attention). Databases with contacts or sources should be password-protected and interview notes and recordings should be stored and shared safely as mentioned earlier. For more information, check out Exposing the Invisible: The Kit, which includes articles on "How to Manage Your Sources" and "Interviews: the Human Element of Your Investigation".

Laura Tich (SheHacks_KE): As a media professional, it is your responsibility to keep your sources safe. Some of the precautions you should take are as follows; try as much as possible to avoid direct contact with your source in cases where their lives are at risk if you need to contact them directly, use secure communication platforms such as Signal. Any files shared should be encrypted, se secure platforms for whistleblowers such as https://afrileaks.org/

Henk Van Ess (Journalist): You are not alone: be judicious about promising confidentiality. Keep secrets secret and read "Protecting Sources in the Digital Age".

Chris Dufour (Digital security consultant): Training, training, training. Professionals should establish an organisational training plan that they themselves employ. Part of that plan should address how to work with sources digitally, from initial contact to ongoing communication and file transfer. Try to keep things as "old school" as possible: meet in person, talk on the phone, take handwritten notes. Avoid putting sensitive documents or photos in places that can be hacked, especially when dealing with sources operating within repressive regimes whose security services do not offer the same respect for privacy as Europe.

Viktor Vecsei (IVPN): Journalists need to take responsibility for the safety of their sources by sharing simple best practices and guides before they start receiving sensitive information from them. Proper preparations are vital during a ‘getting to know each other’ period. They need to act with patience to avoid confusion and development of mistrust before moving on to the information exchange step of a cooperation. For this process they need to learn and verify the technical level and privacy awareness level of their source and do hand-holding to keep them from slipping up, compromising both parties.

Online security must be carefully nourished and updated day by day as your work changes. To communicate publicly is in and of itself a risk.

7. How can journalists balance a professional online presence with their digital privacy and safety?

Laura Tich (SheHacks_KE): Your safety comes first. Separate your personal life from your work life. For example, you can keep private social media accounts for your close family and friends and have a work account open to the public. Some challenges might be unavoidable; such as trolling and harassment, find out what actions you can take in such cases. Avoid posting aspects of your personal life that can lead to physical harm (e.g your location)

Chris Dufour (Digital security consultant): Journalists need more and better education and training on how to audit and manage their digital identities. This is not something that can be addressed once and then it's done. Online security must be carefully nourished and updated day by day as your work changes. To communicate publicly is in and of itself a risk, especially given the unknowns about who owns your data on social media services or what even constitutes your data. Your organisation should develop tested and auditable privacy protection principles for all its members so that there is clear delineation in how you report and promote your work publicly.

Nicola Nye (FastMail): An important step is to understand what security risks you need to protect yourself against. If you think your work might have made you a target to disgruntled individuals or organisations, then protecting yourself from doxing is worthwhile: scrub your personal information from social media and check what comes up if you search for yourself. Use different pseudonyms on different sites to make it harder for a doxxer to link up that @piratequeen on twitter is also @catspyjamas on instagram. Using different email aliases on each site you sign up to also help keep that information separated. If you don't want to share your regular email address with someone you're communicating with, use an alias, which makes it easy to block mail to that address in the future.

Sasha Ockenden (Tactical Tech): Ultimately, the safety of journalists, collaborators and sources has to be the top priority, particularly when investigating highly sensitive information or working in hostile environments. Another article, "Safety First", expands on this with suggestions for good practices, including risk assessment and mitigation. Of course, in reality, a journalist will sometimes have to balance various safety aspects with the need for efficiency in their pursuit of evidence. We call this the ‘Security Trade-off.’ When choosing tools, there may be a trade-off between what is useful, easy to use, or secure. The key is to understand the context you are in, and what you are gaining or giving up by using a particular tool. It is important to identify the points of greatest vulnerability, and at these points, it may make sense to invest in security over functionality or usability (e.g. if a device contains sensitive information entering a password every time, rather than having it saved automatically, in case of theft). Promoting the results of an investigation is important – but depending on the context, this can also be achieved under a pseudonym or indeed a whole separate digital identity, so that the personal details of journalists or others who might be at risk are not easily accessible to the outside world.

Recommended digital security resources

Make sure to check these guides, reports and resources:

]]>
10 data journalism projects that made an impact in 2020: our ultimate COVID-19 roundup https://datajournalism.com/read/longreads/covid-19-data-journalism Wed, 23 Dec 2020 06:36:00 +0100 Andrea Abellán https://datajournalism.com/read/longreads/covid-19-data-journalism When COVID-19 began to spread in early 2020, media organisations quickly pivoted and adapted their editorial coverage to inform audiences about the global health crisis. Amid the chaos and disruption, data journalism drove the reporting for many news outlets. Journalists reported on the daily coronavirus death toll, and case counts. Data teams and scientists collaborated to design one-off interactive explainers about the virus. Fact-checking outlets crowdsourced and debunked misinformation, and investigative journalists dug into medical supply chains and government spending.

Before turning the page on 2020, the DataJournalism.com team has chosen to profile some of the year’s most striking and impactful COVID-19 data journalism projects. Take a look at our top 10 picks for the year reviewed in no particular order:

1. COVID-19: The Global Crisis - in Data. Financial Times

The reporting done by the Financial Times about the pandemic has been impressive and acknowledged by media professionals worldwide. Its data and graphics, built using D3 and updated daily, have been used as a reference for many data teams. During the November 2020 News Impact Summit on Data Journalism, John Burn-Murdoch, the Financial Times’ Chief Data Reporter, explained how the team optimises its graphics for clarity, memorability and reach. A priority was adapting those graphics to social media in order to engage and inform a mass audience. Murdoch insisted on the importance of complimenting data visualisations with texts and annotations whenever needed, to ensure high-quality narratives in which text and visuals carry almost equal weight. He also emphasised the importance of connecting with their readers and listening to their feedback on how the data could be improved and which angles could be further explored. The Financial Times made this content freely available. Since March, the publication has seen a boost in subscriptions.

2. At the Epicenter. What if all COVID‑19 deaths in Brazil happened in your neighborhood? - Agencia Lupa & Google News Initiative

There have been 186,764 fatalities due to COVID-19 in Brazil to date (23 December 2020). At times, the terrible toll of this virus can feel abstract and difficult to relate to for the general public. To address this, “At The Epicenter” simulation brings home the real possibility of the virus affecting our community and loved ones. The project, run by Agencia Lupa and powered by Google News Initiative, uses Brazil’s total number of deaths to illustrate how someone’s neighbourhood would look like if all the deaths had happened there. A user has to enter their address or enable their location to be shown a data visualisation of the deceased represented by white dots.

The project was art-directed by Alberto Cairo who explained to DataJournalism.com that the piece “shows that human beings have a tough time understanding numbers unless those numbers put us at the centre.” The project was published in Portuguese and in English. The data is updated daily, since it was first published on 24 July 2020. Another version for the United States was published by The Washington Post months later. The team behind the project includes Vinicius Sueiro, Rodrigo Menegat, Tiago Maranhão, Natália Leal, Gilberto Scofield Jr., Simon Rogers and Marco Túlio Pires. You can access the methodology and data here.

3. A Room, a Bar and a Classroom: how the coronavirus is spread through the air - El País

Spain was one of the first and hardest-hit European countries by the pandemic. With severe regional and national lockdowns across the country, the coronavirus has posed new challenges for the way Spanish journalists report. Working remotely along with restrictions of movement continues to make on the ground reporting difficult, if not impossible for journalists in Spain and elsewhere. But for El País, one of the country's most-read national newspapers, its data team has thrived despite the challenges. With a small team of three people --Daniele Grasso, Borja Andrino and Kiko Llaneras --their hard work paid off: nine of the 50 most viewed pieces by El País in 2020 qualify as data journalism.

A room, a bar and a classroom: how the coronavirus is spread through the air is one of the publication’s most popular online pieces generating 40 million pageviews and counting. The visualisation explains how the risk of contagion is highest in indoor spaces but can be reduced by applying all available measures to combat infection via aerosols. It provides an overview of the likelihood of infection in three everyday scenarios, based on the safety measures used and the length of exposure.

The visualisation has been widely shared by other media outlets around the globe, becoming what the team described as “a virus itself”. For them, the team's mixed skillsets in design, storytelling and science are responsible for its success: a visual journalist, a scientific journalist and a chemist. At News Impact Summit in November 2020, they explained its impact: “We know people have started opening their windows after reading this piece.”

3 key lessons El País' data team learned from the pandemic

1) Coding is for journalists: this skill helps the team work more efficiently and eases the flow of production from conversation to data exploration and publishing the final piece.

2) Transparency matters: It’s important to show the data. Journalists have to choose the most relevant visuals and variables to be efficient, but readers still want to see the data for themselves.

3) Practice analytical journalism: data alone is not enough. It should be accompanied by a robust analysis that embraces uncertainty and is written in a clear way. It’s important to make clear that journalists do not have all the answers.

4. How the Virus Got Out - The New York Times

The New York Times is known for producing compelling data journalism. Its coronavirus coverage was no exception. One of its first data-led coronavirus pieces came early on in the pandemic: How The Virus Got Out simulation illustrates the travel patterns that caused the outbreak to spread since the first cases were spotted in the capital of Hubei province in China. What reportedly began in a seafood market in Wuhan led to it becoming the first jurisdiction in the world to be placed under a lockdown in late January 2020.

Jin Wu, Weiyi Cai, Derek Watkins and James Glanz were The New York Times journalists involved in producing the piece. They explained to DataJournalism.com that the main challenge for them was to decide “how to tell the story based on all these data and be clear what we know as well as what we don’t know. We read tons of preprints of the papers, then went through these papers with experts in the field to make sure the findings were solid. The story was published at the early stage of the pandemic when the world was trying to understand how the virus spread from a few isolated cases into a global pandemic. We analysed the movements of hundreds of millions of people to show why the most extensive travel restrictions to stop an outbreak in human history hadn’t been enough.” They used Python to scrape and analyse travel data, Adobe Illustrator and three.js, a WebGL library, to build the visualisation.

5. Africa’s Data Journalism Alliance Against COVID-19 - Pulitzer Center

Africa’s Data Journalism Alliance Against COVID-19 supported by Pulitzer's Center on Crisis Reporting launched in May 2020. The initiative aimed to publish 20 high-quality journalism pieces about the pandemic's social and economic impacts in African societies.

DataJournalism.com asked Jacopo Ottaviani, Code For Africa's Chief Data Officer, to share his favourite pieces from the series. He highlights an article published in cooperation with Kenyan-based publication AfricaUncensored, which explores the roots of the pandemic in Africa through its transmission via pangolins. Another of his chosen pieces is a photo-reportage covering the impact of the virus on the education system in Kenya published by EverydayAfrica.org.

"In Code for Africa and Wanadata, we are used to working in these sort of distributed environments. However, I would say that the main challenges we have faced are related to logistics, combining data with on the ground pieces of evidence without being able to travel around. Journalists across the continent were under lockdowns, and they had to do these journalistic works from their desks," said Jacopo.

6. The Coronavirus Simulator - The Washington Post

Data Reporter Harry Stevens is the author of “The Coronavirus Simulator” -- a data visualisation that became the Washington Post’s most viewed online article ever. When the general public was unaware of the power of taking social distance, Stevens managed to illustrate how important it was to slowing the spread of the coronavirus -- and how it could impact our lives months later.

“There’s definitely been an emotional response to this piece. This is a very anxious time for a lot of people. But when you see that you can change the outcome of this by modifying your own behaviour, it gives you a sense of control”, Stevens explained to DataJournalism.com.

He collaborated with his colleagues at The Washington Post to simulate how the disease could spread through a number of different scenarios, including adopting social distancing practices.

The Washington Post made this article freely available to all readers and translated it into 12 languages. Many news outlets have made this conscious effort, acknowledging the life-saving role that information has for readers.

7. The #CoronaVirusFacts Alliance - Poynter Institute

Mis and disinformation remain serious threats throughout this pandemic. Citizens worldwide required some clarification about the thousands of coronavirus rumours circulating online, which the World Health Organisation famously called an “Infodemic.”. This has forced many newsrooms to make fact-checking and debunking false information an important part of their editorial strategy.

To help, the #CoronaVirusFacts Alliance, led by the International Fact-Checking Network (IFCN) at Poynter Institute, organised over 100 fact-checkers working in more than 70 countries. It is the largest collaborative project related to the world of fact-checking to date. The database contains over 9,000 fact-checks in 40 languages and is updated daily.

Laura del Río, a member of the Alliances’ partner organisation Maldita.es, told DataJournalism.com that “thanks to the CoronaVirusFacts Alliance communication channels and database, we have quickly known if a rumour was also spreading in different countries and if some other members of the alliance had already denied it. We have even been able to prepare for the arrival of misinformation such as the Plandemic videos. It has been very relevant at a time when it is extremely important to react quickly to the avalanche of hoaxes and misinformation”.

8. The Koronamonitor - Átlátszó

With over 90% of Hungary's media controlled by the government, Átlátszó is one of the few independent investigative Hungarian news outlets to show how the government has dealt with the pandemic. To help audiences understand the health crisis' magnitude, Átlátszó's data team created the Koronamonitor, a resource that they have updated daily since March 2020. This comprehensive list of graphs and maps outlines the coronavirus outbreak in Hungary and is a resource that was first falsely accused by the Orban government of being distorted. However, it has proven to be very useful for Hungarian citizens. Átlátszó has made it more interactive by adding a simulator where the audience can set different parameters and see how they affect the virus forecast.

Data journalism has also been at the forefront of the Átláztsó’s team’s investigative reporting. For instance, it revealed how a relative of a politician profited from the sale of medical supplies and disinfectants, and how the government invested more resources in protecting churches and its opera house than to curb the coronavirus. As Tamas Bodoky, executive director of the organisation, told The European Journalism Centre: “Investigative reporting on COVID-19 was in high demand since most of the Hungarian press did not cover the pressing issues during the lockdown. We have learned if you have enough courage to report on problems during a crisis, your audience will reward you.” With the success of Koronamonitor, Átláztsó now plans to continue investing in more data-led journalism.

9. Ojo Público

Beyond reporting on Perú 's daily coronavirus cases and providing readers with government lockdown rules, the team at Ojo Público, an award-winning Peruvian news outlet, has also published in-depth investigations that deserve a shout-out.

Since 2020, personal protective gear and medical equipment, such as face masks and respirators have now become essential for countries managing the pandemic. Ojo Público looked at the supply chain of these products and the monopoly within it.

Gianfranco Huamán, the journalist involved in many of these investigations explains that “we had to be careful with the information, even if it came from official sources. In press conferences, state authorities showed figures and statistics showing the effectiveness of strategies to counteract the advance of the virus, but in hospitals and health centres, doctors and relatives told another story. Later, the government published the ‘honesty of deceased’ reports aimed at showing the true toll of the COVID-19, which somehow changed our outlook compared to other countries in the region. However, even though these figures came from health authorities and were official, we decided to report and investigate to see if the policies adopted by our authorities were effective, and in turn, verify the data they provided. I would say that the use of data to understand this disease and the pandemic was a key factor, as it was the support of doctors who helped us to understand the terms related to the epidemiology much better. Initiatives from other media such as The New York Times, Financial Times that were based on data and visualisations inspired us to replicate some of their work in our country.”

The staff working to cover the pandemic was organised into seven categories. Most recently, Ojo Público also published Infodemia, a digital book looking at how disinformation has affected the pandemic. Written in a satirical tone, the book looks at how rumours are created, and how false news related to the vaccine can spread.

10. Vaccine Bootcamp - Reuters

With governments investing heavily in COVID-19 vaccine trials and rollouts, educating the public about this matter has never been more important. This is especially true with vaccine hesitancy growing thanks to online mis and disinformation. To better explain vaccine development, Reuters' Vaccine Bootcamp demystifies the process.

The interactive's cartoonlike design is as engaging as it is visually appealing. With data obtained from the Vaccine Centre at the London School of Hygiene and Tropical Medicine, the piece unfolds in a clean and light scrolling format with useful explainers about the different types of vaccines.

The piece was reported by Ally J. Levine and Minami Funakoshi, illustrated by Catherine Tai and animated by Adam Wiesen.

Conclusion

The pandemic has emphasised the importance of transparent and accessible data that embraces public service journalism, helping citizens be informed and holding governments to account. This article has touched upon some of the challenges and opportunities that have emerged during this unprecedented crisis. The potential for the cooperation between journalists, scientists, designers and developers is an example of this. As an industry, we have learned this interdisciplinary approach is fundamental to current and future crisis reporting.

One theme that has defined 2020 is the growing threat of widespread mis- and disinformation, a potential danger to public health. Newsrooms have witnessed how essential it is to stay up to date with digital verification skills. We have seen data journalism's power, especially when combined with quality on the ground reporting and top-notch technologies. But most importantly, these data projects have shown that they haven't lost sight of the main goal: publishing relevant and accurate data that is easy for our readers to relate to and understand. This has been a recap of some of the most impactful pieces of data journalism of 2020. Admittedly, while this list is not exhaustive, we are excited to see what comes next from data journalists. Now, let us move on and upwards to 2021!

Correction: A previous version of this article stated Harry Stevens of The Washington Post collaborated with Lauren Gardner, an associate professor at Johns Hopkins Whiting School of Engineering, and her team to simulate how the disease could spread through a number of different scenarios, including adopting social distancing practices. This is incorrect. In fact, he collaborated with editors and the data team at The Washington Post. The article has been updated to reflect this.

]]>
Harnessing Wikipedia's superpowers for journalism https://datajournalism.com/read/longreads/harnessing-wikipedias-superpowers-for-journalism Wed, 02 Dec 2020 07:00:00 +0100 Monika Sengul-Jones https://datajournalism.com/read/longreads/harnessing-wikipedias-superpowers-for-journalism Orientations to Wikipedia often begin with its enormity. And it is enormous. The encyclopedia will be 20 years old in January 2021 and has more than 53 million articles in 314 languages. Six million are in English. According to Alexa.com, Wikipedia is the 8th most-visited web domain in the United States, and the 13th globally; it’s the only non-profit in the top-100 domains. In November 2020, more than 1.7 billion unique devices from around the world accessed Wikipedia articles. Average monthly pageviews surpass 20 billion.

Beyond reach, there’s the data. All data on and about all Wikipedias—from pageview statistics, most-frequently cited references, to access to every version ever written and all the editors who have ever contributed to it—is freely available. Entire version histories are available at dumps.wikimedia.org.

Twitter bots that share edited Wikipedia entries text from high impact IP addresses, such as the White House, which is covered by the @whitehouseedits bot, pictured above, can help data journalists track malfeasance. But there’s evidence the bots can be manipulated. Image credit: Twitter @Whitehousedits

Thanks to the free and open access to billions of human and machine-readable data, corporations and research centres have been leveraging Wikipedia for research for years. Benjamin Mako Hill, assistant professor of communication at the University of Washington and Aaron Shaw, associate professor of communication at Northwestern University, describe Wikipedia as the “most important laboratory for social scientific and computing research in history” in their chapter in "Wikipedia@20", a new book on Wikipedia published by MIT Press, edited by Joseph Reagle and Jackie Koerner.

“Wikipedia has become part of the mainstream of every social and computational research field we know of,” Hill and Shaw write. Google’s knowledge graph and smart AI technologies, such as Amazon’s Alexa and Google Home, are based on metadata from Wikimedia projects, of which Wikipedia is the best-known. Significant for data journalists is how Wikipedia’s influence has already surpassed clicks to article pages; in a way, the internet is already Wikipedia’s world, we’re just living in it.

But journalists know well that ubiquity shouldn’t stand in for universality. We should be mindful that indiscriminate use of “big data” without acknowledging context reproduces what Joy Buolamwini, founder of the Algorithmic Justice League, calls the “coded gaze” of white data. Safiya Umoja Noble, a critical information studies expert and associate professor at UCLA, challenges the acceptance of invisible values that normalise algorithmic hierarchies.

Internet search results, which often prioritise Wikipedia articles in addition to using Wikipedia’s infobox data or structured data in sidebars, “feign impartiality and objectivity in the process of displaying results” Noble writes in "Algorithms of Oppression: How Search Engines Reinforce Racism".

Systemic biases on Wikipedia, including well-documented “gaps” in coverage, readership, and source, are cause for pause. Globally, volunteer contributors are predominately white males from the northern hemisphere. On English Wikipedia, less than 20% of editors self-identify as female. Asymmetries in participation have impacted the editorial processes and content. Editors who self-identify as women often perform “emotional work” to justify their contributions. Women and nonbinary users on Wikipedia may encounter hostile, violent language and some have experienced harassment and doxing. Then there are the asymmetries in the breadth and depth of coverage; only approximately 17% of biographies on English Wikipedia are about women.

How to contribute to Wikipedia

Anyone can edit Wikipedia, but there is an editorial pecking order and policies to keep in mind. Tips for success:

  1. Assuming you have created an account, be sure to include a bio on your user page (you don't need to use your real name, but you can).

  2. Improve existing articles to begin, you can create new articles once your account is four days old and you’ve made ten edits.

  3. Include verifiable citations to secondary sources for any new claims--or claims where a citation is needed.

  4. Be aware of Wikipedia’s guidelines on conflicts of interest.

Beyond this, there are many tutorials and videos with various tips and tricks. Among them, this is a useful high-level summary, while an editing tutorial hosted by the Wikimedia Foundation walks you through nitty-gritty basics.

With this glut of imperfect or missing data, what’s a data journalist to do? Journalists doing internet research might consider that they are already knee-deep in a minefield of constraints.

“The reality for journalists working on the internet is fraught,” said Hill. “Most internet data sets are controlled by commercial companies. That means there’s never going to be a full data set and what’s available has been—or is being—manipulated. Wikipedia is different. It’s free, it’s accessible, and it’s from a public service organisation.” Like any institution, as Catherine D’Ignazio has pointed out in this publication, context may be hard to find. On Wikipedia, that’s often due to the decentralised organisation of open source projects; volunteers come and go, rather than intentional obfuscation.

Nevertheless, Noam Cohen, a journalist for Wired and The New York Times who has written about Wikipedia for nearly two decades, said in a phone interview that journalists should—if they are not already—use Wikipedia’s data, including pageviews and the layers of information found in article pages. But Cohen cautions journalists not to let Wikipedia’s decisions on coverage replace news judgement. “In journalism, word length is often a sign of importance,” Cohen said. “That’s not the case on Wikipedia, there are articles about "The Simpsons" or characters on "Lost" that are longer than articles about important women scientists or philosophers. But these trends don’t mean there are not rules. There are, the information is changing.”

To leverage Wikipedia’s superpowers for data journalism, it’s best to climb into the belly of the beast.

Last year, Cohen’s editor asked him to write about why his Wikipedia biography—which he did not create, there are guidelines barring “conflict of interest editing”—was deleted. Cohen dug in and discovered it was due to “sock-puppetry;” that’s shorthand for editors who use more than one account without disclosure. Later, another editor restored Cohen’s biography.

Stories like this may give journalists discomfort about the contingencies of the online encyclopedia, and any data sets therein. And for as long as there’s been Wikipedia, there have been editors and professors warning us to stay away. But Cohen suggests thinking otherwise. “The fact that information is slowly being changed and is always saved is Wikipedia’s superpower,” said Cohen. To leverage Wikipedia’s superpowers for data journalism, it’s best to climb into the belly of the beast.

Understand how Wikipedia’s authority works

While one might reasonably guess that The Wikipedia Foundation manages editorial oversight, that’s not the case. All content decisions, including developing and managing bots to do tedious, repetitive tasks—fixing redirects or reverting vandalism, as ClueBot_NG does—are designed and run by volunteers. The Wikipedia community has developed a number of policies and guidelines to govern editing, including a rule about verifiability and a blacklist of publications not allowed to be cited on Wikipedia. Blacklisted publications include spam and publications that do not fact check and circulate conspiracy theories.

In 2017, Katherine Maher, executive director of The Wikimedia Foundation, spoke with The Guardian about the volunteer community’s decision to blacklist The Daily Mail as a reliable source. “It’s amazing [Wikipedia] works in practice,” she said, motioning to a concept that academics have called peer-production or crowdsourcing. “Because in theory it is a total disaster.” Wikipedia works in practice, and not in theory. It’s a popular idiom among Wikipedians, as Brian Keegan writes in Wikipedia@20. And it does suggest there’s something magical about the project, where successful shared editing of a single document has been happening long before Google docs.

There is a logic to Wikipedia—no magic. The free encyclopedia launched in 2001 for “anyone” to edit. This was not an explicit democratic effort to engage portions of the public who have historically been left out of structures of power, though some have championed Wikipedia for getting close to achieving this. Rather, the effort was a wildcard reversal of Wikipedia’s failed predecessor, Nupedia, which was designed as a free, peer-reviewed encyclopedia edited by recognised experts. When shifted from experts to “anyone”—that is, people who happened to have computers, internet connections, a penchant for online debate and were familiar with MediaWiki, as opposed to busy academic experts—contributions flowed faster.

Wikipedia was also a product of its time. It was one of many online encyclopedia projects in the early 2000s. According to the Section 230 of the 1996 Computer Decency Act in the United States, Wikipedia, like other platforms then and now, has been immune from legal liability for contents. Section 230 also gives platforms the legal blessing to govern as they see fit. Jimmy Wales, co-founder of Wikipedia, set up the Wikimedia Foundation to oversee the project and sister platforms in 2005, and it has remained volunteer-run. The Wikimedia Foundation has an endowment of more than 64 million, with tech titans such as Amazon pledging millions, and the Foundation supports projects by volunteers and affiliates. English Wikipedia has snowballed in popularity on a commercial internet. Google, for instance, prioritises Wikipedia articles in search results—treats them like “gospel” said Cohen, while the convenience, currency, and comprehensibility of Wikipedia attracts regular readers.

Using pageviews to tell a story

Data journalists can find the granular level of insight about pageviews handy for storytelling. Viewers of Wikipedia come from around the world. The Wikimedia Foundation does not track individual data, but tracks devices across pages. Data about what type of device—mobile app, mobile browser, or desktop browser—are used to access pages. This can give journalists insight into topical and regional access trends.

More radically, pageviews can reveal kernels of stories yet to be broken. Let’s simulate research using pageviews for a story on the rising COVID-19 case count in light of concerns about circulation of misinformation and disinformation on the virus. Digging into pageview data on COVID-19 articles in English Wikipedia can help to tell this story, and others like it.

In spring 2020, as unprecedented economic and social changes unfolded across the globe, journalists were at the forefront of providing coverage on this moment. Meanwhile, conspiracy theories were gaining visibility in social media groups, while edit counts and information queries about all articles related to COVID-19 were at their highest to date.

By mid-November 2020, a new trend. Positive cases of COVID-19 skyrocketed around the globe. Several European countries and U.S. states re-introduced lockdown measures to slow the spread of the virus. But Wikipedia pageviews for articles about COVID-19 were not rising, in fact, they were lower than earlier in the year. The election pageviews on the presidential candidates and their families were cresting with the U.S. election.

A line graph above shows a pageview analysis from Nov 2019 to Oct. 2020 (x axis) depicting pageviews by the thousands (y axis) of four article pages: Donald_Trump, Coronavirus_disease_2019, Joe_Biden, and George_Floyd. Source: pageviews.toolforge.org/

Did election coverage distract readers from the pandemic? Spikes in readership on Wikipedia are often the consequence of other media attention or events, which could help to explain for the peaks in views for George Floyd, Donald Trump and Joe Biden. Koerner, who trained as a social scientist, cautions journalists not to make quick deductions about readers' motivations from high-level pageview data. “It’s tricky to say that pageviews are indicative of what people are thinking,” she said. To dig into more granularity, journalists can dig in and compare sets of pageviews using the browser-optimised pageview visualisation tool available.

Above is a blue bar graph showing pageviews to the Symptoms of COVID-19 article page rising from October to November 22, 2020 (x axis) by the hundreds (y axis). Pageviews to the Symptoms of COVID-19 rose by hundreds in under two months.

Meanwhile, pageviews of the COVID-19 general article may have peaked in the spring, but data journalists can note that pageviews of the article “Symptoms of the coronavirus” rose in October, as depicted above, before the peaking case numbers. Incidentally, this correlation could lend credence to the suggestion by a team of epidemiologists in 2014 that high pageview data about influenza-related Wikipedia articles could be used to make predictions about the percentage of Americans with influenza. While it remains to be seen if pageviews can predict illness spikes, the data can offer a wide lens on the zeitgeist.

Above is a list of the top 10 most viewed articles in 2019, in order of popularity, with lists of the number of edits and editors. Avengers: Endgame, Deaths in 2019, Wikipedia, Ted Bundy, Freddie Mercury, Chernobyl Disaster, and List of Highest-grossing films are top seven. Wikimedia Statistics provide high-level data on trends in pageviews, including top-viewed article pages. The data was accessed at Pageviews.toolforge.org.

Behind the scenes

With approximately 300 edits per minute—which is soothing to listen to—Wikipedia is always changing. You may already have edited Wikipedia, the blue “edit” tab is on almost every article page. There are more than 1.2 billion speakers of English and over 40 million Wikipedia accounts.

Maybe you made an account and your changes stuck. Maybe you tried to write an article, only to have it deleted. Or maybe you wondered about how easy it is to add profanity to an article on a popular topic—only to realise that the “Edit” tab is missing. Rather, there’s a lock. Or possibly, a gold star.

Locks. Gold stars. Deletions. These are subtle signs and signals that can help you understand how the editing community works.

Above is a labelled diagram of the parts of a Wikipedia page using the example of the Black Lives Matter article. While every article page has these features, I've chosen to label the Black Lives Matter article because it is an extensive composite of the movement's history, it's been peer-reviewed by editors and is locked, which makes vandalism more difficult.

Wikipedia’s “best” are marked with green crosses and gold stars, these are Good and Featured content which have undergone “peer review.” They are the minority among Wikipedia's millions, just 0.1%.

Meanwhile, the active editorial community on English Wikipedia monthly is about 4,000 editors. Fewer are administrators. As of November 2020, approximately 1,100 users have successfully undergone a “request for administratorship” and have been granted additional technical privileges, including the ability to delete and/or protect pages. Non administrative editors, however, may patrol new pages and rollback recent changes.

Wikipedia’s editorial judgement can spark justified outrage.

Journalist Stephen Harrison covered this recently in his Slate article on the Theresa Greenfield biography. While archivists, indigenous and feminist communities have noted the reliable source guidelines exclude oral histories, ephemera, and special collections; I am currently co-leading an Art+Feminism research project on marginalised communities and reliable source guidelines, funded by WikiCred which supports research, software projects and Wikimedia events on information reliability and credibility. Data journalists can follow debates on-wiki, and note what is absent, by looking at article Talk and View history tabs, and on notice boards for deletion and reliable sources.

At the same time, there’s plenty to be discovered with Wikipedia. Article features such as wikilinks, citations, and categories can help data journalists quickly access a living repository of information.

Above is a labelled diagram showing the wikilinks, citations, and categories using the example of the Black Lives Matter article. On Wikipedia, hyperlinks within articles generally lead to other Wikipedia articles, citations are footnotes with references listed at the end of an article. Categories may help journalists find other articles.

In 2011, an editor began a list that documented people killed by law enforcement in the United States, both on duty and off duty. Since 2015, the annual average number of justifiable homicides reported was estimated to be near 930. Tables about gun violence, have been collected on Wikipedia for nearly a decade.

Above shows a diagram of portions of the article list of killings by law enforcement officers in the United States, including a monthly table from pre-2009 to 2020. This Wikipedia list has amassed data sets from hundreds of sources that verify the killing of humans by law enforcement officers. Between 930 and 1,240 people are killed by police annually in the United States.

The integrity of this list was brought to my attention by Jennifer 8. Lee, a former New York Times journalist. She expressed surprise that there are not more examples of journalists using Wikipedia’s data. Lee would know, she co-founded the U.S.-based Credibility Coalition and MisinfoCon, and supports WikiCred, which addresses credibility in online information and includes Wikipedians, technologists, and journalists.

“[These] are fascinating and useful,” said Lee. “Not automated, this is a hand-written list. It’s all in one place. This is useful for journalists and those of us in the credibility sphere to use it for research.”

Ed Erhart, who works with the Wikimedia Foundation’s audience engagement team, suggests that stories can not only be a repository but fodder for coverage. “I like to say that there is a story in every Wikipedia article,” he wrote by email, drawing my attention to a Featured article about a small town, Arlington, Washington. “Who wrote it? Where are they from? What motivated them? The talk and history tabs on Wikipedia's pages can be the starting point for some truly unique takes on local places and issues.”

Quick links

Catching malfeasance

Data journalists can follow edits to track corporate or governmental malfeasance. Article pages about companies or politicians can be scrubbed to omit negative information. Though editors are required to disclose conflicts of interest on their user page or in the Article's talk page.

Users who edit Wikipedia as a part of their paid work are required to disclose conflicts of interest. This image shows an example of a user who has done so. John P. Sadowski edits articles on biomedical topics, including articles related to COVID-19, using resources from his employer, the U.S. Center for Disease Control (CDC). Not all contributors do this.

Not all contributors disclose. Kaylea Champion, a doctoral student at the University of Washington, led a large scale research project on IP editing and discovered systematic deletions to mining articles. Anonymous editors removed information about environmental contamination and abuse. Champion and her co-authors traced the IP addresses that deleted the incriminating information to the headquarters of the mining companies.

Journalists can do their own large-scale reconstructions of edit histories using data from Wikipedia’s data dump, or manually browse pages of interest. Historical contributions can all be accessed, even if they are not visible on the live page. As well, journalists can reach out to editors by writing a note on their Talk page with information on how to connect.

The below GIF demonstrates how to access View History and compare versions of the Black Lives Matter article page, using the Compare Version History tool. Be sure to use the View History tab to compare version histories, which is shown above. You can also click on the timestamp to view an article in full.

Quick links

Tracking with bots

Bots can help with tracking. In 2014, a number of bots were launched by volunteers to track edits made by specific IP ranges and posted the findings to Twitter. Parliament WikiEdits, one of the first, still regularly tweets edits made to Parliamentary IPs in the UK. Similar efforts have been available for The White House, European Union, Norwegian Parliament, German Parliament, Goldman Sachs and Monsanto Company, though not all are up to date.

For data journalists interested in setting up a bot that tweets about anonymous Wikipedia edits from particular IP address ranges in their beat, the code is available from Ed Summers on GitHub under a CC0 license.

Data journalists should weigh the public benefit of amplifying hate speech, harassment, or vandalism, which could be a form of coded language, with reporting.

Pitfalls to avoid: steering clear of media manipulation

Summers created @CongressEdits in 2014, which tweeted IP contributions from U.S. capitol computers. The Wikipedian reported that “Twitter-addicted journalists” soon were mining the bots for story ideas -- some of which did reveal manipulation, such as an attempt to water down the entry on CIA torture. @CongressEdits amassed a growing audience. Things came to a head in 2018. A former Democratic staffer (who was later arrested) with access to the U.S. capitol computers inserted personal information to Wikipedia articles about Republican members of the Senate Judiciary Committee. The Twitter account automatically shared out those details with the large following. Twitter banned the bot as a result.

People can intentionally game the editorial system or interconnections between Wikipedia and other social media platforms. Data journalists should weigh the public benefit of amplifying hate speech, harassment, or vandalism, which could be a form of coded language, with reporting. “Why are people editing articles to say that the [mainstream political party] is [name of radical, violent party]? They want the screenshot,” Cohen remarked. “The best way to get a lie into [the] mainstream is to edit an article, let Google pick it up, and get reporting on it. It’s probably a thrill to plant them.”

Furthermore, Wikipedia has no “real name” policy for editors. Some choose to disclose personal details on user pages, which can help gain the confidence of other editors, but this is not required. Thus, manipulators can mimic the behaviour patterns of a group to blend in.

Joan Donovan, director of Technology and Social Change at Harvard Kennedy School’s Shorenstein Center, calls this a “butterfly attack.” Once the fakes are indistinguishable to outsiders from legitimate accounts, the manipulators push contentious issues to divide and delegitimise the group. Be mindful that you are not also falling for a “butterfly attack”—or perpetuating one by accidentally characterising editors as occupying one particular position over another. Instead, get to know the communities behind the data to minimise harm.

If you discover vandalism or hate-speech on a page history, consider the impact of your coverage on a topic that has since disappeared. Be mindful of the extent to which the effort at public service can dually serve as a form of publicity or exposure for people sympathetic with fringe ideologies or violence. Reporters who stumble across data on hate-speech might report on this in aggregate, without identifying particular details, to minimise harm.

Pro tips for navigating Wikipedia:

  • Get to know Wikipedia’s editorial process and community before reporting on hate speech or harassment

  • Strongly consider the newsworthiness of articles that might give publicity to fringe ideologies

  • Use data in aggregate to avoid revealing details

Circular reporting

In 2007, The Independent published an article on Sasha Baron Cohen that included a line that he had previously worked as an investment banker. Days earlier, the claim had appeared in Wikipedia, and was unverified. Later, The Independent’s article became the citation for the erroneous claim.

None of it was true. And Wikipedia editors call incidents like this “citogenesis,” or circular reporting. There is even a Wikipedia article that compiles known instances. Techdebug blog depicted the Baron Cohen example with the good advice to “pay attention to timelines” when reviewing sources of claims on Wikipedia. When using facts from Wikipedia, trust but verify.

With close attention to detail and context, data journalists can use Wikipedia’s trove of data to elucidate stories of the digital landscape. “Wikipedia is more than the sum of its parts” said Cohen. “Random encounters are often more compelling than the articles themselves. The search for information resembles a walk through an overbuilt quarter of an ancient capital. You circle around topics on a path that appears to be shifting. Ultimately, the journey ends and you are not sure how you got there.”

Monika Sengul-Jones, PhD, is a freelance researcher, writer and expert on digital cultures and media industries. She was the OCLC Wikipedian-in-Residence in 2018-19. In 2020, she is co-leading Reading Together: Reliable Sources and Multilingual Communities, an Art+Feminism project on reliable sources and marginalised communities funded by WikiCred. @monikajones, www.monikasjones.com

Thanks to Mohammed Sadat Abdulai (Art+Feminism, Wikimedia Deutschland), Ahmed Median (Hack/Hackers), and Kevin Payravi (WikiCred, Wikimedia D.C.), and for taking time to interview with me for background research for this story.

]]>
Own your newsfeed, own your data https://datajournalism.com/read/longreads/own-your-newsfeed-own-your-data Tue, 17 Nov 2020 07:30:00 +0100 George Anadiotis https://datajournalism.com/read/longreads/own-your-newsfeed-own-your-data We all have things we care about and follow. Whether it's sports, arts, technology, from the mainstream to the obscure, we gravitate around them. Over time, we tend to both specialise, accumulating knowledge in specific sub-domains, and expand, jumping to adjacent topics or adding new ones to our list. Over time, we all become some kind of expert in something.

This is also true when doing research. Many journalists, for example, are not really experts in the domains their research takes them. Gradually, however, they have -- hopefully -- developed the ability to start from scratch, identify information sources, fact check them, weigh them against each other, and use their judgement to form conclusions and opinions.

This process of acquiring "temporary expertise" is particularly relevant at times like these. Unfolding situations in previously unknown domains like epidemiology, which COVID-19 has brought to the limelight, calls for an organised approach to news consumption to be able to cope with the volume, variety, and velocity of the information thrust upon us.

The data-information-knowledge-wisdom (DIKW) hierarchy as a pyramid to manage knowledge. Reproduced with permission from Tedeschi (2019). Source: Researchgate.net

Essentially, we're looking at big data on the individual level. The quest to go from data to information and ascertain some kind of knowledge out of the process calls for a structured approach to news consumption and research.

Most people, however, don't systematically keep track of the news items they consume and the sources they get them from. Relying on search engines and social media not only to find and consume information but also to store and share it is problematic for a number of reasons.

Besides data sovereignty, filter bubbles and the like, functionality-wise, social media are not well suited for knowledge management. They lack even basic features such as categorisation and search. Search engines are somewhat better in that department, but you still have to rely on a third party to rank information for you, and over time, it becomes harder and harder to locate it.

So what's a data journalist, or just the average person who wants to stay on top of the news, to do?

We'll describe a structured approach to news consumption and management. We will outline the principles, and show how they work using specific tools. The main idea is to use standards and techniques that ensure interoperability, so you can implement the principles irrespective of tools.

Own the newsfeed: why, and how, to organise and aggregate all your news in one place

Despite their shortcomings in terms of knowledge management, social media offer important benefits too: curation and commentary. We rely on people we follow to provide curated news, as well as their own views and comments, because they can add value to the news. In other cases, however, we'd just like to have straight news, right from the source.

In both scenarios, we select and categorise our sources, whether explicitly or not. If you’re interested in Arts, Politics, and Technology, you probably have certain sources you regularly follow on those topics. The first step to an organised knowledge management process is making a list of topics you are interested in, and sources you follow for each of those.

Step 1: Make a hand written list of topics and sources you want to keep up with.

Beyond an exercise in self-discipline, this has very real ramifications. It can help organise newsfeeds, bringing order to chaos. One way some people do this today is by adding sources they follow on Twitter to Lists. Twitter, to its credit, is the only social medium that offers this. The idea, however, is an old one, going back to RSS.

RSS stands for “Really Simple Syndication” and was introduced by Netscape in 1999. Similar to social media, RSS sources offer their content as a feed. The difference is that users are in control: they decide what sources to subscribe to, and because RSS is standardised, they can subscribe to anything, and take their subscriptions with them.

RSS is a standard supported by most web sites, enabling compatible reader software to get notifications as soon as new content is available. Notifications include summaries and articles that pique one's interest can be retrieved in full and read within the reader software.

RSS is a better way of monitoring sources, because it’s standard, and offers benefits such as storage. This is why it’s better to add your source’s RSS feed, rather than its Twitter account if you have a choice. Twitter is good in cases you want to follow individuals without a permanent home for their publications, for example.

A list of articles in an RSS reader, with preview images and titles.

A shot of a full article view.

Users can subscribe to sites they want to follow, and they can also organise their subscriptions in Folders. So if I want to keep up with my favourite art critic, the literary section of the paper I read, and be in the know about upcoming events in my local museum, I can subscribe to their pages, and keep them all in a Folder called Arts.

Organising RSS subscriptions in folders helps keep track of things.

How to subscribe to keep up with your favourite sources on the web using an RSS reader.

Another standard called OPML lets users export and import their Folders and subscriptions across RSS readers. OPML stands for Outline Processor Markup Language. As far as RSS goes, conventional wisdom seems to be that Google killed RSS when it shut down Google Reader, its own RSS reader. This is not true. RSS is alive and kicking, and all it takes to use it is finding a reader that works for you.

Everywhere you go, you can take your favorite sources with you, using OPML.

The idea of structuring your go-to sources in Topics via Folders can also be applied with Twitter lists -- minus the portability aspect. Unfortunately, this means you will have to log in to both Twitter and your RSS reader to keep up with your Arts sources. Fortunately, there is a hack for that.

How to create a Twitter List.

Some RSS readers let users import and view Twitter sources (plus Facebook Pages) as RSS Feeds, too. So you can curate your Topics independently, and still view them all in one place. This is invaluable. Not only does it let you use the RSS Reader as an inbox for Tweets you would otherwise miss, but it lets you use other functionality such as search, highlight etc.

If your RSS Reader supports connecting to social media, your news reading will be unified and upgraded

Trust the process: why a structured approach to news reading is a good idea, and how to make it work

Using an RSS reader and doing some thinking around sources you care about and how to classify them is a solid first step. Now what? That does not really solve the "read later" issue. When reading news, we typically scroll over some items, read some fully, while we may want to return to others later. To accommodate this workflow, we need more than Folders: we need Categories.

Folders are a great tool to organise your subscriptions, but they are somewhat crude. Folders will show all new items for the subscriptions they contain. But they won’t let you categorise single items. This is useful for example if you want to keep custom collections of items applying ad-hoc criteria, as opposed to grouping based on the RSS source they come from. This is where Categories come in.

Although most readers offer some "Save for later", “Star” or similar functionality, we advise against using it. Doing so cancels the effort of sorting sources in Categories, by dropping everything in one bucket again. Plus, this misses the distinction between "I want to read this later" and i "I've read it and it's worth saving".

"Starring" or "saving for later" is not a very good practice. Items you tag this way all end up in one bucket and you typically forget what they were about.

Saving items per Category, and introducing at least two groups of items per Category, is a better approach. So if you have an "Arts" Folder, you need to create an "Arts - To Read" Category plus an "Arts - Save" Category. Luckily, RSS readers provide ways to do this. Inoreader calls these Categories Tags, Feedly calls them Boards. No matter.

Another Grouping you may want to add for your Categories is an Inbox. This will come in handy if your Reader supports search, enabling you to actively filter your subscriptions for specific keywords. Directing the results of your active searches to an Inbox for each Category helps you stay on top specific things you care about.

As a rule, you should think about actions you want to perform with news items that pique your interest and create a Category for each action. This way you will know what is where at any given point in time. That said, however, it's still up to you to actually follow up on your intent, and read or save your news items.

You should create your own Categories. Define a list of actions you want to take for your items, such as marking for reading or saving and create one Category per action. Classifying your items in the appropriate Categorie(s) is an intentional way of knowing where each item belongs, and what to do with it

Some advanced Readers may help there, by automating things for you. For example, by defining rules that trigger when you put items in groupings, and perform actions. So if you put an item in your "Arts - Save" grouping, the item will be automatically saved. This brings us to the topic of saving items. Before that, one last point.

Sophisticated RSS Readers offer functionality such as Rules. This will enable you to automate your workflow, so when you classify an item into a Category, the corresponding action can be taken by the Reader

Your RSS Reader is your news inbox. Fine-grained and powerful as it is, you don't want items hanging around your inbox for permanent storage. There are other tools to help with that. If you've made sure items are properly saved for the long term where they should be, you should clean your Groupings from time to time. Otherwise, items will start piling, and you will lose track.

Categories should be cleaned from time to time. They are not meant for permanent storage, but rather as transit spaces that serve to perform the required actions on your items

Own the data: it don't mean a thing if it ain't got that save button

Getting your incoming news sorted is only half the story. The other half is being able to archive them in a way that enables you to find what you're looking for. Many people these days use note-taking applications like Evernote or Onenote to do this. Their main benefits are integration and full-text search.

Many Readers support note-taking applications, making saving items a one-click action. In addition, items saved in those applications get stored in their entirety, which means you can retrieve them by searching for any word contained in its body.

Popular note-taking applications are integrated into many RSS Readers, enabling you to save items directly.

There are also some serious drawbacks, however -- lock-in and opacity. These applications make it hard to import and export items saved in them in a portable format. In addition, you may get access to the content, but the source is obscured: the link to the original item becomes a second-class citizen.

Note-taking applications like Evernote help save items in their entirety, but their proprietary format and poor folder organization capabilities make them less than ideal as your primary storage

The alternative is to use something considered rather passe these days: Bookmarks. Bookmarks are a standard way of saving links. All browsers have integrated bookmark functionality, and the format for saving bookmarks is standardised, making it easy to import/export between services.

Storing bookmarks goes beyond links: additional information such as title, tags, comments, date added/modified is supported. What’s more, since anything can have a URL, from a file to an image to a web page, anything can be saved as a bookmark.

The most obvious issue with Bookmarks is the fact that they are local to each browser. However, there are ways around this. One way is to use browser sync services. Browsers today enable users to create accounts. Using them, users can save all their bookmarks across devices in a central location, if they always use the same browser, and are logged in.

3rd party, standards-based bookmark services also exist. Delicious, acquired by Pinboard, was the most well-known example of such a service. Diigo has followed its footsteps, BookmarkOS and Raindrop are others. Each has strengths and weaknesses, but their core offering is similar: save anything, anywhere (via browser extensions or mobile apps), annotate with text and tags, store in a standard format.

Bookmarking applications typically do not store the full content of items, but they offer portability and a better classification structure.

To get the best of both worlds, saving items in both a note-taking application AND a bookmark service is recommended (though Raindrop may be able to cover both bases). Bookmarks and notes can also be organised in Folders. In order to have a consistent way of classifying items across systems, the Folder structure created for reading should be replicated in your item saving application(s) of choice.

Less mainstream applications may be harder to integrate and require you to open an item outside of the RSS Reader to save it.

If you are a bit savvy though, you may find a way.

A word of caution regarding Folder structure. Bookmarking allows users to create arbitrarily deep Folder hierarchies. For example, you can have a "Painting" Folder nested within "Arts". As a rule, note-taking applications do not. They only support a one-level structure -- no sub-Folders. If you use them, you will have to lump your otherwise fine-grained hierarchies in big buckets.

Again, some Readers offer automation, so that putting an item in a Grouping, and creating a rule for it, will save it in corresponding Folders in your Storage application(s) of choice.

Using RSS Rules, you can save your items to multiple back ends, as well as perform other actions. If you want to save your items as Bookmarks, while also storing the full text, this could be a way to do this

Annotate, save, share, repeat: adding your personal touch, sharing with the world

As great as RSS Readers may be, some items will always come your way via serendipitous browsing, or social media finds. That's fine, as long as they all end up in the same place - your Storage application(s) of choice. To achieve this, you need to:

  1. Make your Storage application(s) easy to access
  2. Intentionally open items you want to save
  3. Ideally, annotate items you save

Making your Storage application(s) easy to access typically comes down to a browser extension. If your application(s) offers browser extensions, this makes it super easy to store anything at the click of a button. Mobile apps help too, if you don’t want to use a mobile browser. Having browser extensions and mobile apps should be among the criteria for choosing Storage application(s) - or any application really.

Whatever you are using to browse news besides your RSS Reader, whether it's a browser or a native mobile application, there is always a way of getting the link to the item you are interested in. If you can do that, it means you can also open that link in a browser, and use your Storage application browser extension to save the item. Or you should be able to use the Share functionality on your mobile app to either send directly to your Storage application(s), or send via email.

Another word of warning here. Most social media offer some kind of bookmarking / save later functionality. Don’t use it. It will lock you in (you can only save items you come across on that platform) and you will lose track asnavigating your saved items is chaotic.

If you are in a mobile application for example, you can store an item directly via sharing on your Storage application, or even email. Many applications accept incoming items via email.

Even if you are browsing in a custom application on your mobile phone, there is always a way to get the link for whatever item you are viewing.

Better yet: you can use your RSS Reader to save the item in the Grouping it belongs to. This works in 2 ways. Besides keeping all your reading in your RSS Reader, if you have added automation to your Groupings, the item will be processed without further action on your side.

Adding an item you've stumbled upon outside your RSS Reader to the appropriate Reader Grouping.

Storage applications also let users add annotations, in addition to the item itself. Even though you may not always be willing or able to do this, if you have the time for it, it does add value. Some typically useful annotations are tags and notes.

Tags are the most relevant and characteristic keywords for an item. Adding tags to items works both as a way of quickly finding out what items are about, as well as a way of finding related items. Items with similar tags normally refer to similar topics, and tagging makes browsing across them easier.

Notes are free-text notes which can be shaped to your liking. One way of using notes is to store summaries of the gist of an item's content. This does not necessarily have to be deep and thoughtful, although making it so does help. Summaries, however, can be as simple as adding your personal touch to a link you would share on social media.

Adding annotation such as tags and notes to your saved items adds value. The simplest way is to do it manually, though automation can help here, too.

Thinking about summaries this way opens up an array of options. One simple option is to simply save summaries for your own use, as a note to self. Another option, if you would like to share your summaries with the world, is to actually use your summaries to post on social media. The hard part is writing it, in a way, how to store and share is again a matter of automation.

Reusing your summary can be as manual as copy-pasting it from your RSS Reader or Storage application browser extension to your social media sharing application, or as savvy as using automation facilities in your Reader to make sure your annotations are picked up and stored in all the right places -- including being shared on social media.

Save for posterity, interoperate, automate: standards are your friend

This is just the beginning of what you can do with your news items. Once you start thinking about consuming and storing your news in a structured way, and leveraging standards and automation, the sky is the limit.

We already referred to some key standards: RSS, which lets you subscribe to any news source. OPML, which lets you use your subscriptions in any RSS Reader. Bookmarks, which lets you import and export your items to and from any bookmark application.

If you have mastered these techniques, applications and formats, and you are ready to explore more options, there is one more format, and some applications that can help. The format is CSV, the common denominator for working with data. The applications are back-end integration services like Zapier or IFTTT, which can help capture more data and connect applications to one another.

Many applications today make their APIs available on integration services. You can think of APIs as ways of getting notifications and accessing functionality in applications. For example, scrolling past an item in a RSS Reader triggers an API notification. Opening an item triggers another notification. Calling a Storage application's API can tell the application to store an item.

By using integration platforms like Zapier, you can automate your workflow and save data and metadata about your actions.

Normally, it takes programming skills to use these APIs. But integration services make this functionality available to everyone. They require a fee to use (even though free tiers exist), and they still need time, and advanced application understanding, for things to work. What they offer in return, however, is remarkable: the ability to integrate disparate applications.

Being able to implement hacks such as "Store items I save in that Grouping in my RSS Reader, in this Folder in my Storage application" makes life easier. But there's more to integration than this: saving data for future use.

How would you like to be able to keep track of items you read versus items you skip, times of access, sources, or how you annotate them? That's the kind of thing social media platforms do, which is why they know so much about you. But with the right tools and some savvy, so could you.

You can tap into your RSS Reader API, and store that data in Google Sheets. From there, exporting to CSV, and having that data in a portable format is easy. What can you use the data for? To train a machine learning algorithm that can annotate the way you do, for example. Or to browse, and learn from, your news reading patterns.

Using integration services, a cloud-based spreadsheet like Google Sheets and CSV can save your data and metadata for posterity, opening up a range of possibilities.

One step at a time, in it for the long run

This may sound like a lot. Frankly, it is. But you should not let it intimidate you, as much as you should not expect to go from zero to hero in a day. This approach has been honed over years and distils research and practical knowledge across a number of domains. Take one step at a time. The approach is built in a way that allows you to do this. Some structure is better than none. Some tooling is better than none.

Progressively, you will start becoming more familiar with the principles, and more comfortable with the tools. Of course, you can tweak to your own liking. The point is to find something that works for you. As you will find your own pace and way of doing things, one last word of warning: do not get carried away. The approach and the tools will enable you to process much more information than you thought was possible. You need to know where to draw the line. Information overload is a very real risk to your well-being. Sometimes, just because you can do something, does not mean it’s a good idea to do it. Remember - the point is to find something that works for you.

]]>
Inside the FinCEN Files: How ICIJ analysed damning data on big banks and dirty money https://datajournalism.com/read/longreads/inside-the-fincen-files Sun, 08 Nov 2020 03:00:00 +0100 Emilia Díaz-Struck, Agustin Armendariz, Delphine Reuter, Jelena Cosic, Karrie Kehoe, Mago Torres, Margot Williams and Miguel Fiandor Gutiérrez https://datajournalism.com/read/longreads/inside-the-fincen-files The FinCEN Files reveals the role of global banks in industrial-scale money laundering – and the bloodshed and suffering that flow in its wake.

Drawing on a cache of secret financial intelligence reports, the global investigation reveals how banks’ profit motives overwhelm their legal obligations to stop dirty money — and how a broken U.S.-led enforcement system perpetuates business as usual.

A data analysis by the International Consortium of Investigative Journalists found banks routinely processed transactions without knowing the ultimate source or destination of the money, often to and from shell companies incorporated in secrecy jurisdictions in transactions with potential links to money laundering and corruption. The analysis also found lags from the time of a suspicious transaction to banks’ filing a report.

The leaked documents, known as the FinCEN Files, include more than 2,100 suspicious activity reports, or SARs, filed by banks and other financial firms with the U.S. Department of Treasury’s Financial Crimes Enforcement Network. The agency, known in shorthand as FinCEN, is an intelligence unit at the heart of the global system to fight money laundering.

The global collaboration explored more than $2 trillion transactions dated from 1999-2017 that had been flagged in the more than 2,100 reports by nearly 90 financial institutions. Most of the SARs in the FinCEN Files – 98% – were filed from 2011-2017. The FinCEN Files also contain transaction spreadsheets and FinCEN reports, bringing the total cache to about 2,600 documents.

The FinCEN Files represent less than 0.02% of the more than 12 million suspicious activity reports that financial institutions filed between 2011 and 2017.

According to BuzzFeed News, some of the records were gathered as part of U.S. congressional investigations into Russian interference in the 2016 U.S. presidential election; others were gathered following requests to FinCEN from law enforcement agencies. BuzzFeed News obtained the records and shared them with ICIJ, and journalists from 108 news organisations in 88 countries, to use as a basis of a 16-month investigation into money laundering and the role played by name-brand banks.

Mining the data and exploring the money flows was a project-within-a-project.

The data came with challenges

The suspicious activity reports in the FinCEN Files are a sprawling jumble of documents that reflect the private concerns of global bank money-laundering compliance officers. The SARs include a narrative along with attached spreadsheets of sometimes hundreds of lines of raw transaction data. The reports are of varying quality: some are highly detailed, describing transactions that banks say bear all the hallmarks of money laundering. Others are missing vital information, and reflect a lack of insight by banks themselves about the billions of dollars they are moving for high-risk clients and for other financial institutions. Some records are simply spreadsheets filled with party names, bank names, figures, and dates, that in the FinCEN Files came unattached to the narrative that would provide a reason for their inclusion.

For instance, in the FinCEN Files, compliance officers sometimes leave blank the space intended for the primary address. The address field for more than a fifth of the reports includes at least one or more of the flagged subjects – both individuals and shell companies – with no street number, city, or even country, which are supposed to be designated with a two-character code. In some cases, the blank addresses are for customers in the bank’s own corporate network.

And when an address was included, more than half of the FinCEN Files SARs listed the wrong country code, ICIJ found. For example, on occasions, an address in China would have Switzerland’s “CH” country code assigned to it.

A 2018 Treasury Department Inspector General report found “inconsistencies in how filers report certain critical data fields such as institution name or address.” The report’s review of 39 critical data fields in more than 1.75 million SARs and related documents filed from May 2013 – April 2014, found one or more “data quality errors” including omitting addresses and other critical data fields left blank – in 33.5% percent of the filings.” The report also said there was “no mechanism in place” to ensure that errors in SAR were corrected.

In response, the agency’s management said it had made reforms that it believed “strike the proper balance of data quality with data urgency and usefulness.”

Mining the data and exploring the money flows was a project-within-a-project. ICIJ coordinated a massive global effort involving more than 85 journalists in 30 countries to extract data from the PDF files that contained the SAR narrative reports, as well as to gather more than 17,600 additional records, many via freedom of information requests.

ICIJ shared the records with partners on its bespoke sharing and research platform, Datashare, which is developed by ICIJ’s technical team.

ICIJ and its partners analysed the data using statistical and textual analysis. ICIJ also built a bespoke fact-checking tool to process the extracted data and deployed machine learning to review more than 60,000 addresses that were part of the data. All addresses were later checked manually.

Most of the SAR narratives in the FinCEN Files cache didn’t include attached spreadsheets containing transaction-level data. But since the narratives often contained key details about money flows, ICIJ, BuzzFeed News and media partners explored the reports’ roughly 3 million words as part of the analysis.

Here are the findings.

At least 20% of the reports contained a client with an address in one of the world’s top offshore financial havens, the British Virgin Islands.

Searching both numbers and text

ICIJ’s analysis found that in half of the reports, banks didn’t have information about one or more entities behind the transactions. In more than 680 reports in the FinCEN Files, financial institutions asked for more information about entities and on more than 160 occasions other banks didn’t respond. Some banks or branches in countries such as Switzerland cited local secrecy laws in their jurisdictions to deny the information.

An ICIJ analysis also found that banks in the FinCEN Files regularly processed transactions for companies registered in so-called secrecy jurisdictions and did so without knowing the ultimate owner of the account. In more than 620 of the reports, banks flagged the use of “high risk” jurisdictions at least once. Corporate account holders often provided addresses in the U.K., the U.S., Cyprus, Hong Kong, the United Arab Emirates, Russia and Switzerland. At least 20% of the reports contained a client with an address in one of the world’s top offshore financial havens, the British Virgin Islands.

Deutsche Bank’s 982 filings represented 62% of the total amount in suspicious transactions in the leak. The FinCEN Files also contain large numbers of files from Bank of New York Mellon, Standard Chartered, JP Morgan Chase, Barclays and HSBC.

This data is not representative of all SARs received by the U.S. Department of Treasury's Financial Crimes Enforcement Network. The 1,943 SARs in this data cover transactions between 1999 and 2017.

ICIJ’s analysis revealed a median time lag of 166 days -almost half a year- from the time transactions took place and the time it was reported to FinCEN. Federal rules require financial institutions to report a suspicious transaction in most cases within 30 days after detecting it.

The analysis found some cases in which banks file reports in response to news reports, (including ICIJ’s 2016 Panama Papers investigation), or judicial legal filings involving customers long after the fact of the transactions.

ICIJ also found suspicious transactions tied to more than 20 companies and individuals flagged by the banks that were linked to corruption, fraud, embezzlement or sanctions evasion cases (and produced an interactive to present key details about these clients).

The analysis found that suspicion of money laundering operations was the most common reason given for filing a report in the FinCEN Files. Other reasons were suspicion of fraud, a FinCEN category called “financial instruments (monetary contracts),” and suspicion of so-called structuring, a series of transactions designed to avoid red flags.

ICIJ reviewed each extraction three times. The fact-checking alone took seven months.

A global effort to mine the data

After removing duplicates, standardising bank names and other preliminary steps, ICIJ performed textual analysis to identify sentences in the narratives that could indicate the presence of a shell company or that a bank didn’t know the ultimate owner. ICIJ used the SQL and Python programming languages for the analysis.

ICIJ, BuzzFeed News and its partners tried more than one form of programming in an attempt to extract details from the more than 8,000 pages of narratives automatically (much was still done by hand). At first, ICIJ partner SVT used machine learning to screen the records and obtain a first set of transactional data. The variations in language and the complexity of the reports prevented the capture of some key details.

In the end, ICIJ and its partners launched a giant data-extraction effort: for more than a year, 85 journalists in 30 countries reviewed and extracted transaction information from assigned suspicious activity reports and manually entered it into Excel files, which were then uploaded to ICIJ’s communications platform, the Global iHub. The effort resulted in 55,000 records of structured data and included details on more than 200,000 transactions flagged by the banks in the SARs.

After the extraction was complete, ICIJ reviewed each extraction three times. The fact-checking alone took seven months. Using the Django web framework, ICIJ built its own fact-checking tool that highlighted the information extracted by each reporter, allowing colleagues to flag errors and track edits throughout the process.

ICIJ held extensive training sessions for partners on the use of ICIJ’s technologies for research and follow up review sessions to better understand the data. The project also made extensive use of ICIJ’s Global iHub and secure conference calls to coordinate the complex undertaking.

Through this massive effort, ICIJ was able to find details that would have otherwise remained hidden on more than $380.6 billion in the FinCEN Files, including, for instance, more than $9.3 billion in reported suspicious transactions involving the gold trading company Kaloti. More than a fourth of the total amount of suspicious transactions reviewed as part of the FinCEN Files investigation were related to gold.

This giant extraction effort helped track correspondent banks, global banks with access to the U.S. Federal Reserve, by their financial institution customers around the world. The analysis found that in the FinCEN Files jurisdictions as Latvia and Hong Kong were among the most common locations of local banks receiving or sending money via correspondent banks.

Public records

ICIJ also found huge discrepancies between the amount that so-called limited liability partnerships had filed in UK government financial statements and the amount bank compliance officers reported flowing through the same companies’ accounts. ICIJ found more than $4.5 billion appeared in the FinCEN Files as flowing through LLP accounts more than the LLPs report in their financial statements as revenue to Companies House, the registrar of companies in the United Kingdom long criticised for allowing corporations to register with secret owners.

ICIJ also used information from the Venezuelan Registry of Contractors and public records databases Sayari and Vendata to identify in the FinCEN Files more than $4.8 billion in reported suspicious transactions with links to Venezuela between 2009 and 2017. Nearly 70% of that amount listed a Venezuelan government entity, such as the Ministry of Finance, as a party to the transaction.

Connecting the dots

Finally, ICIJ used graph databases Neo4J and Linkurious to visualise and explore the FinCEN Files’ 400 spreadsheets containing data on 100,000 transactions. These were among the many tools to help piece together a nuanced picture of a broken system.

Go behind the scenes of the FinCEN Files investigation:

Contributors: Miriam Pensack, Pierre Romera, Jeremy Singer-Vine and John Templon

This article was adapted from a story originally posted on ICIJ. It was edited and republished on DataJournalism.com with permission.

]]>
A journalist’s guide to US opinion polls https://datajournalism.com/read/longreads/a-journalists-guide-us-polls Thu, 29 Oct 2020 11:00:00 +0100 Sherry Ricchiardi https://datajournalism.com/read/longreads/a-journalists-guide-us-polls Donald Trump’s surprise victory in the U.S. presidential election four years ago was a catastrophic blow to the reliability of opinion polls. As November 3 approaches, journalists – and the public – ponder a crucial question: “Can the polls be trusted this time?”

A Washington Post headline pinpointed why so many voters are feeling a sense of dread: “Biden leads Trump. So did Hillary Clinton. For Democrats, it’s a worrisome campaign déjà vu.”

While national polling shows Biden ahead, “It is but a snapshot, with a built-in margin of error that can go either way or not at all. Voting may have begun, but...voters have changed their minds before,” The Washington Post reported on October 18.

Four years ago, pollsters struck out on several fronts. Among the most notable glitches, polling and interviewing were shut down a few days before the election, missing on-the-fence voters that swung heavily for Donald Trump. There was higher voter turnout in many rural counties, likely Trump territory, and lower turnout in urban hubs favourable to Hillary Clinton.

Don’t assume the previous election is going to be the model for what’s happening now.

The 2020 election has its own troubling factors.

“Getting accurate poll data this year is being complicated by the pandemic, wide-spread mail-in voting, hyper-polarised constituencies and daily news surprises. Journalists shouldn't make matters worse by superficial and careless reporting,” wrote former Los Angeles Times editor Frank O. Sotomayor in an article “Reporting on polls? Here’s how to do it responsibly.”

Improving the quality of coverage isn’t rocket science if you know a few essentials, said Sotomayor. “Too many reports, for example, ignore that each poll carries a margin of error—or explain what that means. Adding fine print at the bottom of a graphic doesn’t cut it.”

A Poynter article gives journalists some useful advice on how to navigate the 2020 presidential election cycle.

Here is the good news. Boning up on polling fundamentals and concepts is easier than ever using online tools. There are webinars and videos on polling, online courses and guidelines from veteran journalists. Most are free and accessible.

For journalists, reporting on polls is serious business, a time-honoured responsibility that gives voice to ordinary citizens about their leaders and the social, political and economic issues that impact their lives. Reliable polls are regarded as the most accurate way to measure public opinion, but that comes with a caveat.

“Not all polls are created equal, and it’s a challenge for reporters to put polling results in proper perspective,” cautions Louis Jacobson, senior correspondent for PolitiFact, a Pulitzer Prize-winning fact-checking website. “Don’t assume the previous election is going to be the model for what’s happening now.”

Huge sample sizes sound impressive, but sometimes they don’t mean much.

Building confidence in polling

Courtney Kennedy, director of survey research at Pew Research Center, has a ringside seat for opinion polling in the United States. She is optimistic about this election cycle. There are more polls in key states, and some are of higher quality.

In several battleground states like Michigan, there are double or more polls than in 2016, and many are addressing the technical issues that went wrong last time, said Kennedy, but polls are not infallible. Overall, national polls have performed well, she says. Errors in state-level polls in 2016 were “large and problematic.”

In an article for Pew Research Center, Kennedy stressed the value of polling to a democracy. “A robust public polling industry is a marker of a free society. It’s a testament to the ability of organisations outside the government to gather and publish information about the well-being of the public and citizens’ views on major issues,” she wrote.

An article by Courtney Kennedy published on Pew Research Center's website explains why polling a society matters to democracy.

There also was a list of what the public should know about polling heading into the 2020 presidential election. Here is a sampling:

  • Different polling organisations conduct their surveys in quite different ways. “Currently, CNN and Fox News conduct polls by telephone using live interviewers, CBS and Politico field their polls online using opt-in panels, and The Associated Press and Pew Research Center conduct polls online using a panel of respondents recruited offline. These different approaches have consequences for data quality, as well as accuracy in elections.”
  • Huge sample sizes sound impressive, but sometimes they don’t mean much. “Students learning about surveys are generally taught that a very large sample size is a sign of quality because it means that the results are more precise. While the principle remains true in theory, the reality of modern polling is different. As Nate Cohn of the New York Times explained, `Often, the polls with huge samples are actually just using cheap and problematic methods.’”
  • The barriers to entry in the polling field have disappeared. “Technology has disrupted polling in ways similar to its impact on journalism: by making it possible for anyone with a few thousand dollars to enter the field and conduct a national poll...there has been a proliferation of polls from firms with little to no survey credential or track record.”

Pew Research Center's website offers a field guide to polling and instructive videos under the heading, “Methods 101,” exploring topics such as random sampling, wording for survey questions, and how polling is conducted around the world.

PolitiFact’s Louis Jacobson also paints a more positive scenario for opinion polls this year. In 2016, “Polls were all over the map, zigzagging from start to finish. They appear much more stable this time,” said the former deputy editor for Roll Call, a Washington, D.C., newspaper that covers Capitol Hill.

“People know a lot about Trump. They have had their minds made up for or against him for a long time. The choice [between Biden and Trump] appears to be more fixed,” said Jacobson. He cites another factor: Americans are voting early by mail or in-person in record numbers.

As of October 28, more than 75 million votes had been cast around the US, surpassing the 58.3 million total pre-election votes in 2016. That is more than 54 percent of the overall turnout for the 2016 election, according to The New York Times. Data science consultant Rob Arthur believes the transformation in how people are voting could be a snag for polls.

“Pollsters are doing their best to cope with this, but it’s really hard to know how all these different factors – coronavirus, new ways of voting, the constant stream of news – will impact an election,” said Arthur. “Even the best pollsters may run into problems they haven’t anticipated. PolitiFact has posted a primer on what voters should pay attention to when reading polls, including weighting for education, poll wording, cherry-picking poll results, and cell phone/Internet polling. "We are entering a new golden age for polling if the tools are used properly," Anthony Salvanto, CBS News elections and surveys director, told PolitiFact.

Writing about polling data

The London-based Market Research Society (MRS) is a hub of quality control for opinion polling with more than 5,000 members in 50 countries, according to the website. “We have an ongoing commitment to supporting members’ professional development and providing them with tools they need to uphold the highest standards,” said CEO Jane Frost.

She points to resources available free on the MRS web page created for journalists and the media. The goal, she says, is to train the trainers, especially journalism educators because, “That’s the best way, to catch them early. We want these materials to be disseminated and used.”

A training module on the MRS website, “Interpreting polls and election data – guidance for media and journalists,” provides an overview on polling that is adaptable to workshops, seminars and classroom instruction. Newsrooms might also use it for professional development.

MRS partnered with IMPRESS, the UK’s independent press regulator, to create “Using surveys and polling data in your journalism,” a training tool on how to spot unreliable research and ensure that stories on polls are statistically sound. There is a focus on common mistakes reporters make, such as a lack of understanding of statistics. For instance, not knowing the difference between UK and Great Britain for statistical purposes.

In November 2019, Market Research Society and IMPRESS co-wrote "Using surveys and polling data in your journalism."

MRS also works closely with the British Polling Council (BPC), another opinion poll watchdog. The BPC’s quick guide for reporting on polls, posted on its website, begins with a hypothetical:

“The results of a poll have just landed on your desk. You have to write a report about it in a matter of hours. But, can you trust it? What should you be looking out for? And what details should you include?” Here is a quick guide for you need to know and do – in just five minutes.”

It explains how polls are conducted, which ones should be avoided, what can go wrong and the limits of polling and ends with a checklist of five questions reporters should answer before writing their story.

Another resource worth noting. The Quinnipiac University Poll, a household name in the U.S. polling industry, offers a tip sheet called, “The importance of covering poll data with clarity and accuracy.” It exhorts reporters to answer who, what, when, why and how before writing about a poll.

Quinnipiac states on its website: “Before journalists can report the statistical findings of a poll, it is essential they understand the methodology involved in the development of the poll and the data collection. The first thing to determine is whether the poll is transparent about its methods. There are many factors that go into creating a poll, and each one can have a big impact on the results.”

The bottom line: The more journalists know about polls, how they work and how to evaluate their quality, the closer they come to clarity and accuracy in reporting. That is vitally important during an election campaign where lines between truth and disinformation often have been blurred.

Prepping for election night

Political pundits predict election night may be rife with chaos. A Poynter webinar brought together media experts to ruminate about the potential problems that could arise.

Counting ballots will take longer with a record number of mail-in votes. States have divergent rules on when they can start counting ballots. Social media and extreme partisanship will stroke misinformation. The winner might not be known for days or weeks, depending on the outcome and legal entanglements that might follow.

Those are only some of the issues that could beleaguer Election Night 2020 said newsroom leaders from National Public Radio, The New York Times, CNN’s Washington bureau, and Associated Press, among others.

After the webinar, PolitiFact’s Jacobson co-authored an article listing recommendations for journalists assigned to cover the election. They included:

  • Figure out a coronavirus safety plan for your newsroom. “All the preparations in the world won’t matter if you’re too sick to work. Reporters will be out on the street talking to voters, and at polling places and election supervisor headquarters. Provide relevant equipment, such as masks, to keep your staff safe.”
  • Counter misinformation when you see it. “Debunk or clarify the claim as early as possible in the story, and make sure not to give a misleading impression with headlines and in social media. Online tools such as Crowdtangle can help you find out what’s gaining traction online and reverse-image search RevEye can help verify and track down viral images. Google’s Factcheck Explorer can lead you to fact-checking work of other journalists.”
  • It’s OK to say you don’t know the answer yet. “When there haven’t been enough votes counted to be sure who’s ahead, or when the margins are too close to make a call, don’t rush. It’s more accurate and responsible to say there’s `no clear leader’ than to focus on who is leading with a small margin.”
  • Emphasise the local view. “The big advantage that local media outlets have over national ones is that they already have journalists on the ground. If a dispute arises in a specific state, reporters in local newsrooms are often best positioned to sort out the facts.”

Jacobson’s article also stated a reality: Whether it’s orderly, or whether it proves to be the weirdest election night ever, it is certain to be historic. That appears to be a foregone conclusion. Foreign Policy magazine has dubbed the Biden-Trump race. “The Most Important Election. Ever” in American history.

Resources that can help:

  • Reporters Committee for Freedom of the Press: “Election Legal Guide” with an overview of legal issues journalists may face. Topics include exit polling, newsgathering in or near polling places and access to ballots and election records.
  • Journalist's Resource: Explainer on polls and an article “11 questions journalists should ask about public opinion polls.”
  • FiveThirtyEight: Focuses on statistical analysis and has updated its pollster rankings
  • Roper Center iPoll: The largest archive of U.S. public opinion data, covering the last 65 years. Use iPoll for individual survey questions and RoperExpress for entire datasets. First-time users must register, but it's free.

For more information on election data for journalists, check out:

]]>
Challenging election disinformation with data https://datajournalism.com/read/longreads/challenging-election-disinformation-with-data Mon, 28 Sep 2020 12:00:00 +0200 Sherry Ricchiardi https://datajournalism.com/read/longreads/challenging-election-disinformation-with-data Harvard University researcher Brian Friedberg operates like a detective, snooping in the dark recesses of the Internet. His main target is the far-right conspiracy theory QAnon, a shadowy movement described in a recent CNN report as “dangerous and growing.”

Q has surfaced as a factor in congressional races and news coverage in the 2020 presidential election. “We are Q” posters and T-shirts have appeared at President Donald Trump’s rallies, sending up red flags.

Operatives function as provocateurs intent on luring followers away from mainstream media. Even conservative Fox News has been in their crosshairs. “They encourage distrust of any news sites outside of their frame,” said Friedberg, a senior researcher with the Technology and Social Change Project at the Harvard Kennedy School.

He has followed the movement since it appeared on the anonymous imageboard 4chan in October 2017, and has watched it grow into a global phenomenon.

Political parties attempting to manipulate news coverage is nothing new, but Friedberg’s research shows QAnon has expanded its base and pushed disinformation to an extreme.

In August, The Guardian reported that QAnon’s media ecosystem includes “enormous amounts of video content, memes, e-books, chatrooms, and more, all designed to snare the interest of potential recruits, then draw them `down the rabbit hole’ and into QAnon’s alternate reality.”

Twitter, Facebook and other social networks are flooded with QAnon-related false information about the 2020 election, COVID-19, and Black Lives Matter protests. Facebook has removed or restricted thousands of QAnon and militia groups and their accounts promoting conspiracy theories and hate speech.

QAnon has drawn attention in high places. Asked about the movement during a press briefing, President Trump claimed its followers “are people that love our country” and “I understand they like me very much, which I appreciate.” The FBI has designated Q a domestic terrorist threat.

For journalists, it is a Catch-22. Ignoring these groups is not an option. The public has a right to know who they are and the threats they pose. But, amplification also is an issue. Does press coverage unwittingly provide oxygen to extremist movements? Experts like Friedberg, an investigative ethnographer specialising in anonymous communities, provide a window into how these groups operate. Social media platforms provide an entry.

Inside Q, Friedberg has seen detailed instructions from organisers on how to create fake accounts, what kind of memes to spread, which influencers to target on social media, and how to get journalists’ attention. “My job is to get deep into these spaces, get to know them, get to know their norms and what’s important to them,” he said.

A lot of the online world right now is designed to grab media attention. Journalists are the main target.

Media manipulation in practice

In February, Wired published an article, “QAnon Deploys `Information Warfare’ to Influence the 2020 Election,” describing how the movement planned to “flood social media with pro-Trump, pro-Republican, and anti-Democratic narratives or, failing that, to simply hijack and derail conversations.”

A screenshot that accompanied the story declared “memewar on Democrats,” labelling them “traitors.” Wired reported Trump retweeted QAnon accounts at least 72 times, including 20 times in one day in December 2019.

Screenshot of post on 8kun, captured December 16, 2019. Courtesy of Elise Thomas, Wired.

Media professionals are on the frontlines of this global information war that defies basic standards of truth, fairness and impartiality.

“A lot of the online world right now is designed to grab media attention. Journalists are the main target,” said Joan Donovan, research director of the Shorenstein Center on Media, Politics and Public policy at the Harvard Kennedy School. Among her areas of expertise: online extremism, media manipulation, and disinformation campaigns.

Donovan and Friedberg co-authored a report for the Data and Society Research Institute on “source hacking,” a technique manipulators use to get reporters to pick up falsehoods and unknowingly amplify them to the public.

The study contains an important message for journalists: “Learning the tactics of source hacking is a starting point for understanding manipulation campaigns and for designing platforms that can defend against them.”

Data & Society's new report explains the difficult balance journalists face in the fight against disinformation.

The report breaks down the tactics used by trolls into four categories:

  1. Viral Sloganeering: “Repackaging reactionary talking points for social media and press amplification”
  2. Leak Forgery: “Prompting a media spectacle by sharing forged documents”
  3. Evidence Collages: “Compiling information from multiple sources into a single, sharable document, usually as an image”
  4. Keyword Squatting: “Strategic domination of keywords and sock puppet accounts to misrepresent groups or individuals”

Friedberg provided examples for each category, highlighting how these groups hijack media attention.

Viral sloganeering: The Q movement strongly identifies with several slogans popularised through hashtags, memes, videos, posters and online conversations. Among the most common, “Calm before the storm,” used to refer to upcoming arrest and indictments of President Trump’s political enemies.

“Where we go one, we go all,” also known as WWG1WGA, is a community-building phrase and a call to action for participating in meme campaigns and crowd-sourced Q posts. “The Great Awakening” is a promise to followers of the salvation to come.

Forged leaks can be deployed across social media, with manipulators attempting to drum up enough activity to trigger further news coverage.

An explanation from the report: “Because these forms are easily transmitted and copied, they can quickly spread to public forums, both online and offline, and thus become removed from the group that created them. If manipulators are able to hide the source of the slogan and create a sufficient social media circulation, mainstream media sources may provide even further amplification.”

Case in point, during October 2018, anonymous social media users posted #JobsNotMobs, an attack on immigration in the run-up to U.S. midterm elections. The slogan moved from fringes of the right-wing internet to the top, drawing a Tweet from President Trump. His campaign distributed “Jobs vs. Mobs” signs at rallies.

To the originators, it was like winning the Super Bowl. “Any mention in the press is a victory. Even hit pieces by journalists are shared in these community as trophies, proof they are being noticed,” said Friedberg.

Leak forgery: On Aug. 10, The Daily Beast reported that a “flight log” of Jeffrey Epstein’s “Lolita Express” private jet containing dozens of Hollywood A-listers had gone viral among QAnon followers even though it was “laughably fake.”

The doctored flight list featured dozens of celebrities, including Barack Obama, Crissy Teigen and Beyoncé, with no known ties to Epstein. A cross-reference of the screenshot list with the flight logs released in the court record found it named 36 celebrities who never set foot on the plane. The powerful and elite are prime targets for forgers. So are election campaigns.

In December 2017, Republican congressional candidate Omar Navarro released a phoney document via Twitter targeting his opponent, Congresswoman Maxine Waters. It appeared to be a plan for Waters to accept a donation from a bank in exchange for allowing 41,000 immigrants to move into her South California district. Despite being discredited, months later the tweet containing the forgery had nearly 15,000 retweets, 12,000 likes, and remained online.

From the report: “Like viral slogans, forged leaks can be deployed across social media, with manipulators attempting to drum up enough activity to trigger further news coverage. By staging conversations about the forged leak through alternative news outlets and social media, manipulators draw in mainstream news coverage before any entity can debunk the documents.”

Evidence collages: Friedberg directs journalists to the website for examples. It contains more than 200 “Qproofs,” billed as “a collection of QAnon evidence produced by anonymous patriots.” A section labelled “Breadcrumbs” has archived Q graphics dating back to October 2017.

“These simple graphics are a huge, huge part of how `knowledge’ is produced in the Q community,” Friedberg said.

He points to Aug. 12, 2017, Charlottesville Unite the Right Rally, when a white supremacist drove his car into a group of counter-protesters, killing one of them and fleeing the scene. Using open source investigation techniques, 4chan manipulators quickly constructed a convincing evidence collage, falsely identifying the driver of the car as a leftist student. The collage was amplified on far-right news platforms and social media.

From the report: “Manipulators use these carefully constructed `infographics’ to sway breaking reporting and encourage further investigation by citizens. Evidence collages often contain a mix of verified and unverified information and can be created with simple image-editing software.”

Keyword squatting: In July, Friedberg wrote an article for Wired about the “blood-harvesting conspiracy,” a plot targeting celebrities and based on a rumour that “global elites” were harvesting the chemical adrenochrome from the blood of children and injecting it to stay healthy and young. He traced a surge of online interest in March to the Covid-19 pandemic.

“Celebrities posting photos of themselves stuck at home and looking less than camera-ready were besieged on social media with accusations that they were suffering from adrenochrome withdrawal. In their logic, [Covid-19] shutdowns had stalled the adrenochrome child-trafficking supply chain,” Friedberg wrote in Wired.

Conspiracists spread the adrenochrome hashtag to new users while, at the same time, harassing their targets. For them, it was a victory.

The report explained: Keyword squatting is the “technique of creating social media accounts or content associated with specific terms to capture and control future search traffic . . . Squatting can also support forms of online impersonation where manipulators use misleading account names, URLs, or keywords to speak as their opponents or targets.”

Friedberg’s advice for reporters: “Don’t give credence to Q’s outlandish claims. We are past the point of having to repeat them in every article. Don’t reproduce or hyperlink to their materials. Photos of Q signs at rallies are not viewed as embarrassing by the Q community. They are viewed as a visual sign of their strength.”

The Oxygen of Amplification draws on in-depth interviews by scholar Whitney Phillips to showcase how news media was hijacked from 2016-2018 to amplify the messages of hate groups.

Journalism on the frontlines

For a project called “The Oxygen of Amplification,” media literacy expert Whitney Phillips explored the fine line journalists walk when covering groups with extremist views.

“When we cover Trump rallies, where do we train our microphones or cameras? Who do we interview? Who gets our attention? Journalists tend to focus on the loudest, the most reactionary,” said Phillips in an interview. “When we amplify those voices at the expense of other voices, it sends the wrong message. It muddies the waters.”

In her study, she offered a stark portrayal of the dilemma media face: “The takeaway for establishment journalists is stark and starkly distressing: Just by showing up and doing their jobs, journalists covering the far-right fringe – which subsumed everything from professional conspiracy theorists to pro-Trump social media shit-posters to actual Nazis – played directly into these groups’ public relations interests. In the process, this coverage added not just oxygen but rocket fuel to as already-smouldering fire,” Phillips wrote.

Many of the 50 journalists she interviewed acknowledged their work provided publicity and may have energised manipulators. How can media avoid becoming a bullhorn for extremist groups?

Phillips, a professor of communications at Syracuse University, advises against framing “bad actors” as the centre of narratives, reinforcing that their behaviour warrants news coverage. Among questions she says reporters should consider:

• Does the story reach beyond the interests of a specific online community to the point where it is being shared and discussed more widely?

• Is there a larger positive social benefit, such as adding to an existing conversation about solutions to a problem, or sparking a new conversation about an important topic?

Will the story cause harm to those involved, including embarrassment, re-traumatisation or professional damage?

First Draft's mission is to protect communities from harmful misinformation. Above are examples of some free resources to help journalists outsmart false and misleading information.

There are resources journalists can turn to. Amiee Rinehart, U.S. deputy director of First Draft, an organisation that fights disinformation, first heard of QAnon in May 2018. She calls the “fringing” of the group by the media as “problematic.” Instead, she urges journalists to focus on their most destructive and dangerous beliefs, such as homophobia, anti-Semitism, and Islamophobia in stories.

“[Q followers] are not merely Pizzagate peddlers who see Trump as the saviour. Their foundational beliefs are deeply troubling and toxic. That’s what media should be looking at,” said Rinehart. For those unfamiliar, Pizzagate is a debunked conspiracy theory that went viral during the 2016 presidential campaign.

Claire Wardle, First Draft’s co-founder and U.S director, posted “10 questions to ask before covering misinformation,” a guide to decision-making. Here is a sampling:

Who is my audience? “Are they likely to have seen a particular piece of mis- or dis-information already? If not, what are the consequences of bringing it to the attention of a wider audience?

How much traffic should a piece have before we address it? “What is the `tipping point,’ and how do we measure it? On Twitter, for example, do we check whether a hashtag made it to a country’s top 10 trending topics?”

How should we write about attempts at manufactured amplification? “Should we focus on debunking the messages of automated campaigns (fact-checking), or do we focus on the actors behind them (source-checking)? Or do both?”

In an article she wrote for First Draft, Wardle warned against giving disinformation “extra oxygen.”

“Efforts to undermine and explain deliberate falsehoods can be extremely valuable and are almost always in the public interest, but they must be handled with care. All journalists and their editors should understand the risk of legitimising a rumour and spreading it further...especially in newsrooms developing misinformation as a `beat’ in its own right,” wrote Wardle.

There are models on how to fight disinformation. Jane Elizabeth, the managing editor of The News and Observer in Raleigh, North Carolina, has made fact-checking a centrepiece of the newspaper’s operation.

“For journalists, it really is a conundrum,” said Elizabeth, who took the job two years ago after a stint with the American Press Institute. “Some politicians understand fact-checking better now [than in 2016] and they’re looking for ways to get around it. They’ve gotten smarter.”

Transparency, said Elizabeth, is a vital part of the process. The News and Observer posts their fact-checking guidelines, ethics code and a sample story that illustrates the rigours of fact-checking on their website. The paper provides readers with a list of resources that were consulted in writing a particular fact check, along with the names of reporters’ and editors’ who worked on it.

Earlier this summer when Donald Trump told a North Carolina audience they should vote twice to make sure their vote by mail counted, the News and Observer was careful not to keep repeating his incorrect statements in their coverage. Instead, they created a Q&A on how to vote and used a pullout quote, `No, you can’t vote twice.’ “We keep updating the site. It has become very popular,” the managing editor said.

As this report indicates, journalism faces an uphill climb as online communities built on deception, amplified through bots, trolls and cyborgs, proliferate and pollute the information ecosystem. Researchers like Donovan and Friedberg fight back with tools of their own, methodical, truthful analysis and warnings of what is to come.

“Disinformation has become part of our contemporary social fabric and it’s not going to go away easily,” said Friedberg. “We have to continue to do this work if we are going to get through it. I don’t have to worry about wasting my time.”

Resources that can help:

The International Fact-Checking Network: Monitors trends, formats and anti-misinformation actions around the world. Publishes regular articles and a weekly newsletter.

Verification Handbook, a definitive guide to equipping journalists with the knowledge to investigate social media accounts, bots, private messaging apps, information operations, deep fakes, as well as other forms of disinformation and media manipulation.

First Draft: Coalition of news organisations that offer free verification resources, tutorials and training materials, including a free two-week online course “Protection from Deception” on the U.S. election.

NBC News: Reporter Ben Collins’ guide to QAnon serves as a well-researched primer on the movement.

Google reverse image search: Verification of pictures to find where else a photograph has been used, and when it was used. Excellent tool for spotting faked or altered photos.

Bellingcat: An investigative journalism website that specialises in fact-checking and open-source intelligence.

For more information on disinformation, check out:

]]>
Capturing racial justice protests with data https://datajournalism.com/read/longreads/how-data-captured-americas-protests Wed, 29 Jul 2020 11:36:00 +0200 Sherry Ricchiardi https://datajournalism.com/read/longreads/how-data-captured-americas-protests Like millions of Americans, Alex Smith watched protesters take to the streets after the brutal killing of George Floyd, a Black man, while in Minneapolis police custody. As he skimmed social media in search of information, a Twitter thread caught his eye.

A BuzzFeed journalist tweeted that protests against police brutality reached beyond major hubs like Chicago and New York City into America’s smallest towns and rural areas. Smith, a data analyst from Tucson, Arizona, was stunned.

“I remember thinking, `Wow, this movement is much greater than I realised. Something big is going on.’ I wanted to create a document as fast as I could to capture the magnitude,” said Smith, who fell in love with maps as a child. He set out to find every city and town where Black Lives Matter (BLM) or George Floyd-related protests had taken place since the May 25 tragedy.

He scraped the first 450 protest locations from maps created by NBC News, Al-Jazeera and other media outlets. The first individual point he added was Pelham, New York, population 12,470, found on a Twitter thread. He turned to Google searches, Wikipedia, and Reddit to track locations. Once the map got traction, he started getting tips via email.

By July 26, Smith’s map contained 4,352 communities worldwide where protests, marches, vigils, and demonstrations had taken place, some in far-flung places like Karachi, Pakistan; Abuja, Nigeria, and Binnish, a city in war-torn Syria. Moscow, Hong Kong and Beirut were in the mix.

Alex Smith is a geographic information system analyst in Tucson, Arizona. His goal is to map every city or town that held a George Floyd/Black Lives Matter protest, action, or vigil using Esri's ArcGIS Online software.

Smith provided a visual portrait of the outcry for racial justice spreading to white, small-town America and into the politically conservative strongholds of Mississippi, Oklahoma and Wyoming. Protesters carried “Black Lives Matter” signs in Pulaski, Tennessee, population 7,652, known as the birthplace of the Ku Klux Klan.

“Seeing the flood of dots on the map is a powerful image. It counters those who say that protests are limited to blue/Democrat urban areas, or that they are driven by Antifa and somehow dangerous or suspect,” said the former lawyer turned technology whiz. “There are people everywhere supporting Black Lives Matter, and it’s astonishing and inspiring to map.”

Verification comes through links to news articles, mentions on social media, and Twitter posts tracked by loyal volunteers.

On July 11, citing Smith as a source, the Washington Post reported that changes in small communities were fueling a racial justice movement across the Midwest. The New York Times and USA Today have used his research; stories about the project have appeared in the Bloomberg Newsletter and on National Public Radio’s Here & Now. Smith has a master’s degree in geographical information systems and technology.

In late July, he still was adding locations. “This is a project I know will never be 100 percent complete. It is perfect to do while I’m self-quarantined,” said Smith, who uses Esri’s ArcGIS online mapping software.

A female demonstrator offers a flower to a soldier at an anti-war protest at the U.S. Pentagon in 1967. Courtesy of Wikicommons/ Photo by S.Sgt. Albert R. Simpson. Department of Defense. Department of the Army. Office of the Deputy Chief of Staff for Operations. U.S. Army Audiovisual.

Media coverage of protests: Then and now

Google, Twitter and YouTube weren’t around when anti-Vietnam war protesters chanted “Hell no, we won’t go” in the late 1960s and early 1970s. These activists depended on mainstream television and newspapers to get their message out. That put them in a precarious position, said media scholar Douglas M. McLeod.

“They had a hard time getting media attention unless they did something dramatic like engage in civil disobedience, which often brought a response from the police. Ultimately, that kind of coverage portrayed the protesters as deviants and had detrimental effects on public reaction,” explained McLeod, who has studied media coverage of social protests for almost 40 years.

The 21st century has seen profound changes in the media landscape. Mobile phones have turned citizens into public watchdogs, capturing events on video and posting them on YouTube. Activists use the Internet and social media as megaphones, bypassing the Fourth Estate. A more diverse media environment gives voice to the voiceless on social issues like racism and police brutality.

During research for this report, the term “protest paradigm” repeatedly showed up in studies about media coverage of social unrest. Coined decades ago by McLeod and fellow scholar James Hertog, it referred to newsgathering patterns that tended to disparage protesters, obscure their role in the political arena, and give greater voice to the authorities.

In June, the term resurfaced in a NiemanLab report titled, “It’s time to change the way the media reports on protests.” According to the article, for almost a week after the BLM demonstrations started “national media made editorial choices, mirroring a framework social scientists have dubbed the `protest paradigm,’ that often failed to frame the events of the day accurately.”

It’s the idea that “the press contributes to the political status quo by reinforcing whatever the government thinks,” media researcher Danielle Kilgo told NiemanLab.

In the context of race and racism, data helps validate experiences and perspectives that are often undercut, misunderstood, silenced or ignored.

At first, McLeod also thought the paradigm might be in play. He observed the framing of the protests as contests between the police and demonstrators. There was more emphasis on the protesters’ acts of civil disobedience than on critical issues like racism. Much of the early coverage followed the same pattern of portraying the marchers as deviant, said the University of Wisconsin journalism professor.

Then, as momentum spread across the globe, he noticed changes in media content -- stories began appearing more sympathetic to the protesters and their causes. “Initially, we saw coverage that reflected the protest paradigm, and then a gradual shift away from that,” said McLeod.

He speculates that communication tools in the hands of the movement made it easier for the public and journalists to have access to the protesters’ viewpoints. The duration and size of the demonstrations made it harder to ignore the legitimacy of issues that drove the activism.

“When public opinion begins to shift the media coverage will respond as well,” said McLeod. A June Pew Research Center Report found that two-thirds of U.S. adults support the Black Lives Matter movement.

There might have been another factor. “In the context of race and racism, data helps validate experiences and perspectives that are often undercut, misunderstood, silenced or ignored,” said Kilgo, who studies media portrayal of social movements. “Narratives holding police behaviour accountable would bring a shift to the protest paradigm.”

Using forensic methods of investigation, mapping tools and visualisation, journalists have shed light on what may be the largest social movement in American history.

A Black Lives Matter Protest in Washington D.C. on 1 June 2020. Courtesy of Unsplash/ Instagram: @koshuphotography

Documenting police violence and racism

In today’s political environment, exposing racism and police brutality often is a prime target for investigative journalists. Data is among the key building blocks.

Videos, maps and graphics have been used to verify police violence against protesters, expose patterns of racism in American communities, and reconstruct the scene where George Floyd cried “I can’t breathe” while a White police officer knelt on his neck. Some of the projects have become prototypes for multi-dimensional storytelling.

In July, the Washington Post published “Resources to understand America’s long history of injustice and inequality,” featuring in-depth stories, videos, photo essays, audio and graphics of Black history and the progress – or lack of it – in the fight for racial justice. It wasn’t the first time the Washington Post put racism under the microscope.

In 2015, the newspaper created a database of police killings, logging in all fatal shootings by an on-duty police officer in the United States. Among the findings: “Although half of the people shot and killed by police are White, Black Americans are shot at a disproportionate rate. They account for less than 13 percent of the U.S. population, but are killed by police at more than twice the rate of White Americans.”

Hispanic Americans also are killed by police at a disproportionate rate, the study found. The Post’s methodology for “Fatal Force” can be read here.

In June, The New York Times zeroed in on police brutality in Minneapolis, examining data from 2015 through May 26, the day after George Floyd’s death. The pattern of abuse was striking: Minneapolis police used force against Black people at seven times the rate of Whites. A map broke down police use of force against the Black population by city block. A graphic illustrated how often neck restraints, chemical irritants and other types of force were used on Black people versus Whites. About 20 percent of Minneapolis’s population of 430,000 is Black. “But when the police get physical — with kicks, neck holds, punches, shoves, takedowns, mace, tasers or other forms of muscle — nearly 60 percent of the time the person subject to that force is Black, according to the city’s own figures,” the Times reported.

For another story, The New York Times’ visual investigative team reconstructed the minutes leading to Floyd’s death, using security footage, witness videos, official documents, and interviews with experts, a combination of traditional reporting with advanced digital forensics. The video shows evidence of officers violating Minneapolis Police Department policies as they tormented Floyd.

There is evidence of police assaults on another front. Bellingcat, a non-profit investigative group, collaborated with The Guardian to visualise police violence against journalists at protests across the U.S. They found more than 148 arrests or attacks on reporters and photojournalists between May 26 and June 2. These were known incidents; the total could be higher.

Charlotte Godart, a Bellingcat investigator, described the data-gathering, process for this investigation: “We compiled the data into a spreadsheet and plotted each of the incidents using geolocation. We compared the streets, buildings, and specific details in the videos and images until we were able to find, in most instances, the exact spots where these journalists were standing when they were detained, pepper-sprayed, or physically assaulted by the police.” An interactive map visualised where these incidents occurred in space and time. The Bellingcat website provides a “Beginner’s Guide to Geolocating’ video.

In search of reliable data

Over the years, Amnesty International has fought a continuing battle against the world’s worst human rights abuses. In May, the spotlight was on America. Between May 26 and June 5, 2020, researchers documented 125 incidents of police violence against protesters, journalists and bystanders in 40 states and the District of Columbia committed by state and local police, National Guard troops and security forces from federal agencies.

According to the report, “Police across the U.S. committed widespread and egregious human rights violations against people protesting the unlawful killings of Black people and calling for police reform.” Abuses included beatings, misuse of tear gas and pepper spray, and firing projectiles such as rubber bullets directly at protesters. A Washington Post story told of eight people being partially blinded by the police.

For journalists, the Amnesty report was a bonanza. The data provided reporters with fodder to explore police activities in their hometowns. It also linked to a larger issue: Brutality by law enforcement on a national scale that went beyond the urban centres.

Investigators found that local police improperly used tear gas against peaceful protesters in Murfreesboro, Tennessee, Sioux Falls, South Dakota, and Conway, Arkansas, among others. In Iowa City, Iowa, police fired tear gas and threw flash-bang grenades at protesters kneeling and chanting “Hands up, don’t shoot,” according to Amnesty. In Fort Wayne, Indiana, a local journalist lost his eye when police shot him in the face with a tear gas grenade.

"The analysis is clear: when activists and supporters of the Black Lives Matter movement took to the streets in cities and towns across the country to peacefully demand an end to systemic racism and police violence, they were overwhelmingly met with a militarized response and more police violence," said Brian Castner, Amnesty's senior crisis adviser on arms and military operations in a statement.

Amnesty’s website posted a summary on how the data was collected: “The Crisis Evidence Lab gathered almost 500 videos and photographs of protests from social media platforms. The digital content was verified, geolocated and analysed by investigators with experience in weapons, police tactics, and international and U.S law governing the use of force. In some cases, researchers were also able to interview victims and confirm police conduct with local police department departments."

Non-profits like Amnesty International provide reliable information and are go-to sources for reporters. Media Bias/Fact Check rates Amnesty high for factual reporting “due to proper sourcing and a reasonable fact check record.”

In some newsrooms, journalists themselves are gaining the skills to do open-source investigative reporting, using visual forensic methods to tell the story. In a June 16 article for Global Investigative Journalism Network (GIJN) reporter Rowan Philp highlighted tools and techniques reporters around the globe are using.

“These tools are absolutely remarkable, especially when we add dogged curiosity. Then they become really powerful,” said Philp, a GIJN media reporter and former foreign correspondent who has covered corruption and conflict in more than two dozen countries.

His article points out that the visual forensics approach “is not about a smoking gun document or whistleblower’s testimony, but rather about assembling pieces of a visual and time puzzle, where, taken together, they have the weight of evidence to have an impact.”

This raw evidence can expose attacks by security forces beyond a reasonable doubt. Philp cited an example from BBC’s Africa Eye where reporters grabbed Facebook Livestream clips of a protest in Sudan in real-time, and later pieced together evidence of the massacre of at least 61 protesters by a hostile militia. The team reconstructed events from 300 mobile phone videos, mostly taken by protesters as they fled.

Philp has high praise for Bertram Hill, part of Africa Eye’s investigative team. Hill has developed an accessible list of more than 200 open source and forensic tools that are Africa-centric but can be applied globally. “These are incredible tools for journalists anywhere and most of them are free,” said Philp.

His GIJN story lists 12 techniques used in recent investigations tracing abuse by security forces. Among them:

  • “Remember the familiar tools you use every day. If you need to show that a protester or police officer could not have walked from A to B in a certain time, just enter those addresses in Google Map, and it will automatically show the walking time.”

  • “Track individuals through video clips by using markers, like distinctive clothing. One man’s bald spot proved to be an important marker for the NYT investigation into the beating of protesters by a Turkish security detail.”

  • “Avoid manipulating footage, unless it is essential for audience understanding, and, if you do, explain why you’re altered it. However, artificially highlighting an object, like a weapon, is appropriate, when presented within the context of the original photo.”

Philp’s story contains a dizzying array of information from timeline-based video editing tools and advanced search functions on Google and Twitter to archiving tools and people-finding apps like Pipl and Spokeo. Hill’s list of 200 resources can be found here.

The use of data to expose police violence and racism is inspiring. It clearly makes a case for visual forensic methods and other cutting-edge technology to be part of the journalists’ toolkit. Philp would like to see reporters around the world develop these niche skills, as he calls them. The results, he said, can be “exhilarating.”

Other resources that can help:

The Committee to Protect Journalists (CPJ) has posted a safety advisory and monitors incidents of police violence against the press via the U.S. Press Freedom Tracker. According to CPJ, the police appear to be responsible for the majority of incidents, although crowds and protesters also have targeted the press.

Among CPJ’s tips for digital safety:

  • Be aware of the information stored on your devices. Think about the type of information police will have access to should they detain you and gain access to phone or laptop.
  • If possible, leave your main phone behind and instead carry a phone that has minimal information on it. If you cannot leave your phone behind remove as much personal information as possible, including logging out of and deleting apps from the phone. For more information, see CPJ’s advice about device security.
  • Turn off location services for your apps]as this information is stored by companies and could be subpoenaed by the authorities at a later date.

The U.S. Crisis Monitor bills itself as “The only source of real-time data that captures both political violence as well as demonstrations in the United States.” This joint effort of the Armed Conflict Location & Event Data Project and the Bridging Divides Initiative at Princeton University features an interactive crisis mapping tool that is updated regularly with new data and trends.

National Press Photographers Association: Practical advice about covering high profile news stories during protests and the upcoming election by Micki Osterreicher, NPPA general counsel. It offers information on a variety of topics, including arrest and release, being questioned or detained, and complying with police orders.

First Draft News: George Floyd Protests: Resources for online newsgathering, verification, protecting sources and colleagues*. It covers how to use social media to report on protests over police brutality and structural racism sweeping the world.

]]>
Coronavirus coverage: giving a voice to the vulnerable with data on your side https://datajournalism.com/read/longreads/coronavirus-coverage-giving-a-voice-to-the-marginalised-and-vulnerable-with-data-on-your-side Wed, 03 Jun 2020 23:07:00 +0200 Sherry Ricchiardi https://datajournalism.com/read/longreads/coronavirus-coverage-giving-a-voice-to-the-marginalised-and-vulnerable-with-data-on-your-side The team of ProPublica reporters faced a daunting task. Using a database obtained from the county medical examiner’s office, they began tracking the first 100 recorded coronavirus deaths in Chicago, America’s third-largest city. What they uncovered was stunning.

Seventy of the first 100 COVID-19 victims were black, reflecting a broad racial disparity in the early toll of the virus. African Americans make up only 30% of the city's population.

The reporters divided up the cases and set out to find relatives and friends to explore why this group was disproportionately affected. “The First 100,” as the project was called, had another mission: To recognise and honour the fallen.

When the story was published, those who died had names and personalities. The bereaved remembered them with tears in their eyes and love in their hearts. They became more than entries on a death list.

This collaborative effort was the perfect combination of data journalism and shoe-leather reporting, although due to the virus, interviews were conducted by telephone and email instead of knocking on doors.

“The best stories marry data and narrative writing,” said Duaa Eldeib, a member of ProPublica’s investigative team based in Illinois. “Our goal was to get as many of their stories as possible in hopes of understanding why the disease was ravaging their neighbourhoods."

Eldeib spearheaded the data search, filing an open records request with the Cook County Medical Examiner’s Office and obtaining data from the Chicago and state of Illinois public health departments. The story, published on 9 May 2020, pointed to a reality: “COVID-19 took black lives first. It didn’t have to.”

Far less attention has been paid to the impact on the world’s vulnerable populations.

An overview: Voice to the voiceless

As COVID-19 spread from China worldwide in January, data journalism played a critical role in providing vital, reliable information about the rapid onset and ferocity of the infections.

Interactive maps allowed the public to follow the virus through cities, villages and neighbourhoods. Graphics illustrated how the contagion invades the body, multiplies and ravages the organs. Jagged lines on fever charts marked the ebb and flow of cases and deaths across the globe. This information was public service journalism at its best.

But, was that the whole story? What did data show about how the coronavirus impacts marginalised communities? If there were disparities in cases and deaths among economic groups, who was most vulnerable?

Millions among the world’s “invisible” populations slip through cracks of the system, making them prime targets. Who was reaching out to them?

“Giving voice to the voiceless is more critical now than ever,” said two-time Pulitzer Prize-winner Martha Mendoza of the Associated Press. “Marginalised immigrants, the homeless, the incarcerated, the poor need to be reached as part of the coverage.”

Mendoza, a member of AP’s global investigative team, called this “a teachable moment for data journalism.” Yet, when it comes to coverage, there appears to be a dichotomy. Stories on how the virus impacts the big three -- politics, economy and healthcare – have exploded on the Internet. Far less attention has been paid to the impact on the world’s vulnerable populations. According to news reports, that could be a serious oversight.

In May 2020, a New York Times headline warned, “As coronavirus deepens inequality, inequality worsens its spread.” The report noted “the pandemic is widening social and economic divisions that also make the virus deadlier, a self-reinforcing cycle that experts warn could have consequences for years to come.”

As the Times’ story indicates, the world’s disenfranchised populations have become a new frontline in the fight against COVID-19. What follows are examples of how media used data journalism to humanise the pandemic’s effect on vulnerable and under-served populations.

We have to remind ourselves, there is a human being behind every single number.

The First 100: Breathing life into numbers

After obtaining the names of Chicago’s first 100 COVID-related fatalities from county officials, ProPublica’s reporters turned to Nexis searches, social media, obituaries, funeral homes, family and friends to build their database.

Operating out of five cities across the country, they held meetings via Zoom and used Google Docs to coordinate reporting. The story led with three victims that reflected the investigation’s major findings, including:

  • An analysis of medical examiner data showed that most of the first 100 recorded victims were black and lived in segregated neighbourhoods where the median income for 40% or more of the residents is less than $25,000 a year
  • Many were already sick with multiple health conditions
  • There was a lack of well-resourced hospitals and healthcare in some neighbourhoods. Poverty, unclear guidance about when to seek treatment, and lack of adequate access to medical care were among factors that contributed to the higher death rates

Phase two of the investigation was the search for those who knew the deceased. For some relatives, the loss was too recent, too difficult to talk about. Others were eager to tell stories about their loved one.

“Some of these families were in the midst of trying to make funeral arrangements, trying to figure out how to mourn their loved ones while observing social distancing requirements, but they still talked to us,” reporter Duaa Eldeib wrote in ProPublica. In the end, families and friends of 22 of the victims shared memories.

Those intimate interviews, said Eldeib, were critical to the story. “With COVID-19, we hear so much about numbers and statistics and comorbidities. We have to remind ourselves, there is a human being behind every single number. For us, it was important to make sure we were incorporating that humanity into our reporting,” she said.

According to the Center for Disease Control and Prevention, nearly 23% of reported COVID-19 deaths in the U.S. were African American as of 20 May 2020, even though black people make up roughly 13% of the U.S. population.

We try to get involved in the story process early rather than delivering individual graphics at the end, or dropping a few lines with lots of numbers into someone else’s piece.

Interactivity sheds light on coronavirus

The nonstop flood of information on COVID-19 has been dizzying. How does the public make sense of it? Niko Kommenda, visual projects editor for The Guardian, offers a solution: interactive journalism.

“Good data journalism is key to understanding how the virus and lockdown measures have affected our lives more widely, what new inequities they have revealed and what lessons we can learn for the future,” said Kommenda. “These are some of the most important stories to come out of this crisis in my opinion."

He cited interactive tools that allow readers to find their areas or demographic groups in large datasets, localise the impact of the disease, provide perspective and context. More relevant information with this kind of technology, Kommenda said.

For instance, the data project team found that Londoners living in the most poverty-stricken areas have less access to private green spaces and would be hardest hit by public park closures. The headline for the April 2020 story read: “Coronavirus park closures hit BAME and poor Londoners the most.” BAME stands for “Black, Asian and Minority Ethnic.

Collaboration among project teams and visuals revealed that ethnic minorities in the UK have a much higher risk of dying from COVID-19, raising further questions about disparities in access to healthcare and safe working conditions.

The Guardian published its first visual explainer on the virus in early February while the majority of cases and deaths still were in China. As their visual tracker evolved, “We were able to establish comparisons between countries, put the data into historical context and shed light on different scenarios playing out,” wrote Kommenda in a Guardian story about the human toll of COVID-19.

Two teams, visuals and data projects often must join forces to work on the most labour- intensive stories. Whenever a project relies on constantly updating a data feed and/or makes use of interactive graphics -- as is the case with our COVID-19 trackers -- someone from visuals will be involved, said Kommenda.

Both teams cooperate with other desks to develop stories -- in the case of coronavirus, that could be the home news, foreign news, business, health or environment desks. “We try to get involved in the story process early rather than delivering individual graphics at the end, or dropping a few lines with lots of numbers into someone else’s piece,” the editor said.

Kommenda’s advice to data and visual journalists covering the virus: “Identify stories where you can give added context and amplify otherwise unheard voices. That’s why we at The Guardian focus on covering, among other things, the social inequality aspect and the environmental implications of this crisis.”

The goal is to cull numbers out or stories and into interactive graphics and to use compelling photographs to put a human face on the pandemic.

What happens inside these facilities is not just happening to criminals. Prisons, like nursing homes, have been incubators for the spread.

Tracking an invisible population

The story was alarming: New Jersey prisoners were dying from the coronavirus at a higher rate than those in any other prison system in America, according to a new study.

After reading the names of the dead, an inmate at a New Jersey state prison told a reporter, “Nobody talks about these men. These men were sons, they were fathers, they were brothers. We’re waiting to see who’s gonna die next.” The story was published by nj.com, a digital news content provider and website.

The study on which it was based came from The Marshall Project, a non-profit news organisation that analyses inequities, discrimination and abuses in the justice system. As the virus swept through America, Marshall Project reporters began creating a state-by-state database on coronavirus in prisons.

Their database has been used by major media, including NBC Nightly News, Detroit Free Press, Baltimore Sun and Associated Press, according to the project’s managing editor for digital and data, Tom Meagher. In addition to inmates, the study found 7,000 prison employees had been infected.

“What happens inside these facilities is not just happening to criminals. Prisons, like nursing homes, have been incubators for the spread,” said Meagher, a veteran reporter and editor. “It’s not just about the safety of the prisoners, but also the safety of employees, their families and the communities around them. The staff brings the virus into the prison and they take it out again.”

The project, named after former Supreme Court justice and civil rights activist Thurgood Marshall, has produced around 70 stories about the coronavirus and prisons, available on the website. It has teamed up with the Associated Press to co-publish data and collaborate on stories.

The state-by-state study originated to fill a gap. “We knew the coronavirus was going to be a massive story and we felt people weren’t going to pay attention to prisons or provide as much scrutiny as we thought was needed. The only way was to start collecting data because no one else was going to do it,” said Meagher.

Be careful. Don’t draw conclusions from anything but actual numbers, and always think about who is the most vulnerable along every step of the reporting.

COVID is everyone’s beat

As part of this article, media experts were asked for advice on covering what NBC Nightly News anchor Lester Holt called, “The biggest story we have ever seen. This affects the entire world. Each and every one of us.”

As the coronavirus swept the world, newsrooms moved into uncharted territory, pursuing a story moving at warp speed. Everyone, from sports editors to food and fashion writers, became part of the COVID beat. The challenges are great; so is the opportunity for journalism to shine.

Steve Doig, data specialist and professor at the Walter Cronkite School of Journalism and Mass Communication at Arizona State University, called the COVID era, “A golden moment for data journalism.” It is at the heart of investigative reporting right now, said Doig who conducts media training and workshops on the topic. He offered the following advice:

  • Take your expertise and look for COVID impact. What kind of stories can be spun off your regular beat? If you cover education, what happens when schools close? Who wins, who loses, who falls through the cracks?

  • Check effects on the voiceless and marginalised communities. Who in the newsroom covers the social services beat? What should they be looking at?

  • Familiarise yourself with useful data sources. Johns Hopkins Coronavirus Resource Center, data banks from the New York Times and Washington Post and others help add context to stories. Do a review so see what they have to offer

  • Look for someone in the newsroom who does census journalism. In which part of town are the most homes owned versus rented? Who is getting evicted? How are they dealing with it? A source that could be helpful: https://censusreporter.org/

“This is new, journalists only have been focused on [the coronavirus] for a few months,” said Doig. “No matter what you covered before, now we are all medical writers; everybody is scrambling to get up to speed on the virus. There still are many unknowns.” In some instances, data journalism has gained new prominence during the COVID era.

Cairo-based Amr Eleraqi, a data journalism pioneer in the Middle East and North Africa region, has seen a turning point. “There was this argument that the public wasn’t interested in data, that they found it boring. COVID taught people in this region to love data,” said Eleraqi, founder of Infotimes.org, the first Arabic website specialising in data journalism.

“We see an increase every day in readers looking for more analysis. They want comparisons about cases and deaths in their own countries and across borders.

In 2017, Eleraqi was the driving force behind the launch of Arab Data Journalists’ Network, featuring training materials, resources, tools and techniques for data-generated storytelling. Fact-checking also is on the agenda as journalists wade through misinformation, conspiracy theories and myths about the virus that pop up on the Internet.

“Accuracy in journalism is more important now than ever. If [journalists] are using data and getting it wrong, then all of us lose our credibility,” said AP’s Mendoza.

Her advice: "Be careful. Don’t draw conclusions from anything but actual numbers, and always think about who is the most vulnerable along every step of the reporting.” She lists the Center for Disease Control and Prevention, the Federal Procurement Data System and Johns Hopkins Coronavirus Resource Center as trusted sources for data.

Another website called NewsGuard tracks news and information sites in the U.S., U.K., France, Italy and Germany and provides a misinformation hotline.

In May 2020, Mendoza co-authored an article for AP on counterfeit masks from China reaching frontline healthcare workers in the U.S., where medical masks were in short supply. Among the tools she uses for this type of reporting:

  • ImportGenius, an international trade database
  • USASpending, the official source of access, searchable and reliable spending data for the US. Government
  • trac.syr.edu, information about federal enforcement staff and spending
  • Marine Traffic, displays near real-time positions of ships and yachts worldwide
  • PACER, Public Access to Court Electronic Records

“Collaborating with colleagues during this intensive time can elevate your work and bring humour and warmth into your daily interactions,” said Mendoza. Collaborative journalism also can provide a wider audience, help to cut costs and encourage quality journalism. An example operating out of Lima-Peru brought journalists from eight Latin American countries together to work on health-related stories. Their attention now has turned to coronavirus.

I always tell reporters we also have to offer hope with our stories. We must show the problems of the virus, but also that it’s not the end of the world.

As part of her International Center for Journalists Knight Fellowship, Fabiola Torres created Salud Con Lupa (Health with a Magnifying Glass), a digital platform for collaborative journalism that has become a hub for data-driven coverage of the pandemic and dispelling misinformation about the virus.

Torres also is one of the founders of OjoPúblico, a nonprofit newsroom in Lima, which spearheaded The Big Pharma Project, a series of multinational investigations that shed light on methods used by pharmaceutical companies to consolidate their monopolies in Latin America. She advises reporters to start with a series of questions when planning a coronavirus assignment. Among the most common:

  • What is the most important thing I need to explain to my audience right now about the pandemic?
  • What is at stake with this disease?
  • Who is benefitting from this global crisis?
  • What are the social and economic side effects?
  • Is the virus creating even more poverty and inequality than existed before?
  • Among her favourite tools:
  • OpenRefine to clean and organise data
  • Datawrapper, to create visualisations
  • Request a woman scientist, to verify information on medical topics and search by country, discipline, interest and degree

“I always tell reporters we also have to offer hope with our stories. We must show the problems of the virus, but also that it’s not the end of the world. The public gets anxious and begins to despair when they only see bad news every day,” said Torres. “We have to find a balance, look for solutions and show the courage of ordinary people who are fighting this silent enemy.”

In an April 2020 column for the New York Times, three epidemiologists noted that a billion people live in the world’s slums. The most important factor in enabling the spread of the pandemic, the doctors said, was the neglect of marginalised populations by governing elites. If journalists don’t tell their story, who will?

History will judge how well journalism fulfilled its public service mission covering the “biggest story ever,” as a headline in The Guardian described the coronavirus. Here are some resources that can help.

Additional resources

Reporters Without Borders, "#Tracker_19: Covid-19 impacts of press freedom”

Committee to Protect Journalists, “Safety advisory: Covering the coronavirus epidemic”

Reuters, “Breaking the wave, measuring the death toll of COVID-19 and how far countries are from stopping it”

European Centre for Disease and Prevention and Control, an agency of the European Union

World Health Organisation, “Corona disease pandemic”

Center for Disease Control and Prevention

Johns Hopkins Resources Center

First Draft, “Coronavirus: Tools and Guides for Journalists”

Investigative Reporters and Editor, “Top tips from heath reporters and officials on covering Covid-19”

Solutions Journalism Network, “Covid-19 containment”

Poynter Institute for Media Studies, “A daily coronavirus briefing for journalists”

Article 19, “Viral Lies: Misinformation and the Coronavirus”

Ethical Journalism Network, “Media ethics, safety and mental health: reporting in the time of COVID-19”

European Journalism Observatory, “How media worldwide are covering the coronavirus crisis”

Knight Science Journalism, Massachusetts Institute of Technology, “Tips and Tools for Reporting on Covid-19”

International Journalists’ Network, "COVID-19 reporting tips and guidelines”

]]>
Mastering data for better business journalism https://datajournalism.com/read/longreads/mastering-data-for-better-business-journalism Thu, 21 May 2020 13:47:00 +0200 Erik Sherman https://datajournalism.com/read/longreads/mastering-data-for-better-business-journalism If money makes the world go round, business journalists communicate and explain the dizzying spins that affect everyone.

Their reporting underpins almost every part of society. There's no shortage of stories about how multinationals make their billions, not-for-profits fund activities, or people invest their money. What journalists add is the critical bridge between complex issues and how people understand the impacts on their lives.

Business reporters who cast these stories in understandable and accessible ways help their audiences make better decisions. As one of the most demanding and dynamic fields in journalism, it can’t be done without an excellent grasp of data.

As a business journalist, you can find yourself doing a quick turnaround on earnings reports to filing briefs on corporate comings and goings in a trade publication, or interviewing a CEO for a profile piece.

Maybe you follow a few specific companies with a contact list of insiders who can give you scoops. Or perhaps you are chasing a long-form piece with character arcs and narrative plotting and pacing.

No matter how you cover business -- text, short social media posts, video, podcasts -- data journalism should be in your tool kit.

You've almost certainly been working with data already: industry or government statistics, U.S. Security and Exchange Commission filings, projections from market analysts, and more. They probably don't seem like "data journalism" because the volume of information is low, or the analysis seems nowhere near as complex as you imagine what data journalists do.

But it's all a matter of degree. Business journalists typically look for some numbers for their stories because they want to compare and contrast things. How company A differs from B. The valuation of a startup and what it would need to achieve to make that number seem reasonable. Consumer trends and financial pressures. All numbers.

Reporters and editors do this without a second thought because it's just part of the work. Large scale or small, though, it's all data journalism, translating information into words or images. Learning more about it, and even collaborating with colleagues who focus on the data aspect, can enrich your work. The more you understand, the better your handicraft will be. It can also be easier than it seems.

Beyond narrative alone

Data journalism relies heavily on maths—essentially specialised languages for conveying certain types of relationships and truths—and technology that allows the storage, manipulation, and analysis of all sorts of information. These fields, different from spoken and written language, can help add power to reporting.

Journalism has an affinity for such traditional story narrative elements as characters, plots, development, and emotional hooks. While fine, the approach has limitations. Reporters and editors might pass over stories that lack the inherent "dramatic" elements but nonetheless are important for an audience.

Data and, by extension, technology and mathematics may seem, through exposure at school, cryptic, cold, and cruel businesses. But, as types of languages and skills, they aren't any more so than the study of music or German or automobile repair.

Data and analysis can mix with the narrative impulse of many journalists. Not as a replacement, as most people won’t take up a quasi-mathematical treatise along with descriptions of the data structures for their entertainment reading. Instead, aim to bolster your current information sources and even story concepts.

Never be satisfied with a summary of data. Get the whole study.

Framing coverage

Before considering how data journalism can help, it's best to start with some analytic thought about your coverage. Glueing data analysis and visualisations onto stories doesn't make much sense if they aren't compatible or needed.

Just as you learn to ask who, what, when, where, how, and why in basic reporting, start with some questions. Here are a few examples, although don't treat them as limits:

  1. What is the nature of my beat and where does it intersect with information?
  2. How do people in the industry measure their businesses' performance?
  3. Are the measures they use reasonable?
  4. Is there information that might illuminate aspects of what I'm trying to describe?
  5. Can data support or refute claims that people I've interviewed are making?
  6. What information puts a company or person into a larger context?
  7. Can I generalise the specific, finding larger frameworks of data that extrapolate from an example and find a matching trend?

As a business reporter with more than 25 years in the field covering everything from startup issues to multinational controversies, here are some examples where I've asked such questions and found answers and applications in my own work:

Are the measures they use reasonable?

I got into a discussion that turned heated on the part of a CEO whom I was interviewing for a company profile.

The business repeatedly used pro forma financials—presentation of results that eliminate one-time gains or losses to show how the ongoing business was doing. Pro forma statements can provide insight or cover shortcomings, depending on how they're used and understood. But this company used them every quarter, which means what should have been exceptions were really usual and expected conditions.

I insisted on discussing standard accounting treatments (called GAAP, or generally accepted accounting principles, in the U.S. and IFRS, for international financial reporting standards, in most of the rest of the world). Such rules allow comparison of companies on an even basis and prevent executives from using accounting as a way to hide the truth of corporate performance.

The approach told me something important about the company's performance and what it was trying to achieve, which was to create a pretty picture that didn't really exist. It also enraged the CEO, who saw a carefully cultivated picture begin to crack.

Is there information that shows aspects of what I'm trying to describe?

A piece I wrote for Fortune explained why life could be so expensive even though overall inflation is at historical lows. To tell the story, I pulled together sources of information about the growth of U.S. inflation (the consumer price index), per-capita disposable income (money left after paying taxes), and costs of homeownership, rent, health, and school and childcare.

I assembled the columns of numbers in a spreadsheet and then indexed each. That is, for every category, I divided the values of all months by the value of the first. By making every month a multiple of the first, I could show growth over time as a series of percentages of that first value.

Then I put everything into a graph format to make the comparisons easy to see. Disposable income always lagged far behind everything else.

Can data support or refute claims that people I've interviewed are making?

A PR person described a client of his as a Fortune 500 company. The business was in a segment of technology I had covered in some depth and yet I had never heard of the name. A quick browser excursion to Fortune's site let me search through the current Fortune 500 list of the largest public corporations. Surprise, surprise, there was no listing for the company.

When I challenged the PR person, he said it was a "Fortune 500 type company." I sent the email to the trash. But imagine the devilry that would have arisen had someone quoted the assertion without checking data to verify. This is an example of how data journalism can do critical background work invisible to the audience.

Forget performing mathematical calculations. Start with a consideration of what a given set of statistics claims to show.

What information puts a company or person into a larger context?

Taking a backward look at the results of the global financial crisis on wages for Forbes.com, I wanted to move beyond mean and median representations of data, which often obfuscate a fuller reality.

Means and medians do offer one way to categorise a body of data, but they do so by eliminating an understanding of how things may vary in different circumstances. If you have €20 and a drinking companion has none, there's little doubt of who will have to pick up the bar tab, even though the average amount each of you has is €10.

A web search revealed that the Federal Reserve Bank of Atlanta had broken out wage growth by wage size, skill level, and full- or part-time status. Downloading the data, I was able to create a set of graphs, like the one below. The graphs painted a picture of how income inequality increased and, even as overall wages began to recover, some categories of workers had lost more than they would regain.

Role of data in business journalism

All journalism serves to answer questions that the public might have. Some outlets— Nate Silver's FiveThirtyEight, TheUpshot at the New York Times, Guardian's data journalism blog—regularly use data as the main tool in business journalism. An intriguing new non-for-profit venture, The Markup, develops and builds its own datasets to report on large high tech companies.

Data sets can become characters in their own right. A story might address a curious pattern someone noticed in information, accompanied by explorations of how the results came to be and why they are relevant to readers.

Or the story could be about the existence and use of the data itself, as in the piece that Olivia Solon and Cyrus Farivar did for NBC News about how Facebook has used its data "to fight rivals and help friends."

Data can also support more traditional coverage, whether inverted pyramid hard news coverage or a feature. When LinkedIn filed a piece to go public, I pulled together a story for CBS Interactive, using data from web searches and financial filings, to compare long-standing claims of profitability with the reality of being in the red for much of that time.

An important rule of thumb is to let data highlight and amplify the answer to an inquiry, or even the existence of the question itself. Avoid using data for its own sake or you risk losing your readers. Before anything else, though, there is preparatory work.

Everyone in business or investing wants to know the future. Estimates and projections try to scratch that itch.

Statistics, studies, and polls

Forget performing mathematical calculations. Start with a consideration of what a given set of statistics claims to show. Who released the numbers? How were they gathered? Are you looking at a collection of data over time, like from a government agency? Or is this a study or poll that requires additional information to understand its validity and limitations?

Never be satisfied with a summary of data. Get the whole study. Years ago, I wrote an entire piece about how a "statistic" about 14% of all laptops being stolen was utter hogwash. A combination of interviews, data, and some easy analysis set up the entire story:

· The company offering the number to every reporter available had a business in computer insurance.

· Because it only looked at its own customer base, the subjects seemed more likely to lose or damage a machine. Otherwise, why buy insurance?

· The insurer did not release the details of how it arrived at its numbers, offering only a figure for how many claims it had. Reporters used estimates from market analysts to calculate a percentage of loss and did not consider how the company framed things to look particularly favourable for its interests.

· Law enforcement agencies and the broader insurance industry did not track laptop loss or theft at all, so there was nothing comparable.

· If you calculated the cost of laptops, the loss rate, and the price of the policies, you would see that the company should have been out of business almost immediately.

Never underestimate how much a company is willing to misuse you. After this article appeared, I noticed that virtually all use of the "statistic" suddenly disappeared from business and tech coverage.

There are many resources to learn more about seeing whether what you are given seems reasonable. I wrote a piece for the Reynolds National Center for Business Reporting about how to assess a survey. Other resources include the American Association for Public Opinion Research, the Pew Research Center, the International Center for Journalists, and the Poynter Institute. Meanwhile, NICAR-Learn also recently made its data journalism videos free for one year.

Historical trends can lead to wrong conclusions.

Checking metrics

Metrics are measurements of ongoing activity, like production figures of a company's blue spanner line and a country's unemployment rate when half of the blue spanner makers are laid off.

They may be the product of regular data collection or something pulled together for an article when journalists round-up information and then categorise and count it. Public companies throw off never-ending compilations of metrics as required by government agencies.

As with statistics, do not automatically take them at face value. Information handed to you might be correct or arranged to create an effect, as the pro forma financials mentioned higher up. Also, historical trends can lead to wrong conclusions. Pointing to sales history as a guide to how a new product will do helps not at all if the product and sales territory are outside of what the company has previously done.

Estimates and projections

Everyone in business or investing wants to know the future. Estimates and projections try to scratch that itch. There was a stretch of time during the big slide this winter when I was monitoring the major U.S. stock indices for Fortune. I kept calculations in a set of spreadsheets, regularly updated to incorporate new information, until finally conditions were right for an article on how the Dow lost all the gains it had made since Donald Trump's inauguration on January 20, 2017.

Warning: not all editors are comfortable with such work. I recently had an editor insist that I take out similar types of modelling in a piece I was working on and only use a number provided by another source. Ironically, the final choice of citation was an article in a publication where a data journalist had independently used the same approach I was using.

More often, third parties offer their projections. For example, how quickly consumers will adopt 5G telephone service, or where the stock market might be in six months. Projections are almost always wrong. If they were regularly correct, the people who make them would find more effective ways to profit from their prognostic talents. Not to suggest you should always avoid estimates. But consider the background of who creates them, their potential motivations, and the history of their previous estimates. Always take such guesses with a grain of salt and remember that data are often uncertain. Recognise the limits while employing it.

Data sources

You must obtain data before incorporating it. This is both easy and challenging. The easy part is gaining stacks of information from many governments. Data on economic and commerce is omnipresent through official agencies. Many countries have clearinghouses that generate and disseminate data. Then additional government agencies often have their own.

Take the U.S. as an example. There are tremendous data resources at all the cabinet-level agencies as well as regulatory bodies and the Federal Reserve and its regional banks. Similarly, you may find data, though not as much, at state and local levels. You will also find data available through universities, thinktanks, corporations, political groups, industry groups, lawsuit court filings (an underappreciated resource), international institutions, polling firms, and analysts, to name a few potential sources.

However, before you embrace that cacophony of information, take a moment to consider a point made by Liliana Bounegru in the Harvard Business Review. Reliance on existing data sets can "exacerbate the tendency to amplify issues already considered a priority, and to downplay those that have been relegated or which aren't on the radar screens of major institutions."

That falls short in two major ways. One, when everyone uses the same data, it can become challenging to find a story not available to everyone else. The second is that neglected people and issues get passed over. One way to get passed this is to start pulling together data from different sources to create a fuller view. A personal example came after the Wall Street Journal ran its articles on public companies that got federal COVID-19 financial relief intended for small businesses.

The Journal sorted through the U.S. Securities and Exchange Commission financial filings to identify public companies that had mentioned the U.S. Payment Protection Program, the small business loan program that was part of the government’s response to the pandemic-fueled economic crisis, then put together its list.

It can be a big task, but one made easier if you use the full-text search at the SEC’s Edgar site or use a third-party SEC data search engines like SEC Info that can allow the type of broad searching across filings. I used the latter when writing about mortgage-backed bonds that included buildings with WeWork as a major tenant last October for Fortune. A search across the body of filings for the term WeWork turned up each prospectus that had to list the largest tenants in the buildings whose mortgages were included in the bond.

Business journalists typically look for some numbers for their stories because they want to compare and contrast things.

Getting the tools you need

You do not have to be a math whiz to do much of this, but it greatly helps to improve your understanding of what you are looking at.

Much of the analysis I have mentioned required a combination of some technical skills, patience, and fundamental maths: addition, subtraction, multiplication, division, and working with and calculating percentages and fractions. Additionally, a grasp of basic probability and statistics helps identify weaknesses in data analysis and source material.

There are many free and low-cost online courses where you can brush up skills and gain new ones. For example, the Knight Center for Journalism at the University of Texas at Austin periodically has offerings in data journalism and data visualisation. The Poynter Institute has self-directed courses. The Reynolds National Center for Business Journalism at Arizona State University has video workshops. And, of course, there is useful material at DataJournalism.com.

If you have the chance to work with a data journalist on a project, that person can also potentially provide help and bolster those areas where you might be weak. You will also want tools to make work easier. Some of the basics for analysis are calculators and spreadsheets. Databases can help, but are more complicated; if you do not have experience, find someone who does.

Sometimes more specific statistical analysis is helpful. Microsoft Excel has many applicable functions, but you need to understand what they do and how they work. For advanced statistical tools, take a class or find someone who already knows how to use them.

Chances are you'll also want data visualisation tools that can help build images that can often better portray data. Do not expect to become an expert in any of this overnight. But, on the brighter side, toss the notion of having to achieve some acknowledged level of ability before starting to incorporate data.

Instead, start from where you are. Look for ways to incorporate data into developing stories or see what ideas data itself might generate. Over time, your data work will become better, enriching your business reporting.

]]>
Reporting beyond the case numbers: How to brainstorm COVID-19 data story ideas https://datajournalism.com/read/longreads/brainstorm-covid-19-data-story-ideas Thu, 23 Apr 2020 08:00:00 +0200 Paul Bradshaw https://datajournalism.com/read/longreads/brainstorm-covid-19-data-story-ideas While many journalists around the world report the daily global death toll and infection rates of COVID-19, audiences are seeking other stories that have a more personal and local impact on their lives. How can journalists use data to tell wider stories about the coronavirus’ impact? From the economy and relationships to mental health, press freedom and privacy, it’s hard to imagine a part of society that hasn’t been hit by the crisis. In this piece, journalists will learn how to use empathy to create story ideas with data.

Numbers are dominating our news updates right now — not just in the daily death tolls and counts of coronavirus cases, but also in stories that attempt to establish the scale of the crisis’s impact on the world, from drops in transport and air pollution, or figures establishing how many children are having to celebrate birthdays behind closed doors, to histograms that almost shoot off the printed page.

And it’s clear that there’s a significant demand for numbers-driven analysis. Coming up with story ideas in this scenario requires resourcefulness and creativity rarely asked of journalists — but there are a number of techniques that can help.

The newness of a story lies not just in its data, but in the questions that the data leads you to ask.

Stories in the short term

In most countries, and in most fields, there is a delay between the collection of data and its publication. Data journalists often report on new data which relate to events a few months ago, or long-term trends — but during the coronavirus crisis these practices don’t always make sense. And while daily updates from health bodies and statistical authorities are eagerly awaited — they are also widely reported and picked over.

So how can you add something new to your reporting? Contextual data can be especially useful at a time like this, and boiling down an issue to its key parts can help you focus on what context is newsworthy.

The coronavirus crisis, for example, is not a deadly disease in isolation: it is a deadlier disease when our capacity to treat it is impaired. And the strategy of most governments has not been about stopping the spread of the disease, but rather slowing it to ensure that hospitals are not stretched beyond their capacity (which would result in deaths that could have been prevented).

What data is important in this context? The number of ventilators is one data point that has been particularly scrutinised; the amount of PPE is another. The number of hospitals, and their locations; the number of health staff and the number of beds — these are all parts of that capacity, and all aspects of the story that can give it something new.

Data on those allows us to establish a baseline for a country or region’s ability to handle the demands being placed on it, before adding further context on attempts to build more hospitals, add beds, bring staff out of retirement, and so on.

As we learn more about the virus, new avenues of context will open up. If ethnicity is a factor, how can we put that into context? If cramped housing is a factor, what picture can we paint of the situation regarding the state of housing? Context can be added through both time and space too: has the number of hospital beds been going up or down? Does your region or country have more or less than other places? You can combine both by looking at how the rate of change (time) compares with other areas (space).

The newness of a story lies not just in its data, but in the questions that the data leads you to ask.

If you are struggling to get data remember that a lack of data — or flawed data — is often newsworthy in its own right

If your data tells you that there aren’t as many beds in your country as there are in others, you might ask experts, charities and politicians why this is so, or what should be done about it, what is being done about it, or some other line which leads to a new story.

If you are struggling to get data, remember that a lack of data — or flawed data — is often newsworthy in its own right. A lack of data on ethnicity can lead to a story on warnings from politicians and pressure groups; a lack of data on deaths outside of hospital can lead to questions about why it’s not being counted, queries to care homes for indicative figures, and explainers on how many are really dying. Scrutinise the data being presented by politicians and health heads carefully: in the UK it has emerged that the government’s daily coronavirus briefings “repeatedly and incorrectly indicated that the UK has fewer coronavirus deaths than France, based on the numbers of deaths in hospitals”.

The Health Service Journal added an orange cross to this slide from the UK government to indicate the hospital covid-19 deaths for France that should have been used as a comparison with the UK’s hospital deaths.

Moving beyond health

The work to prevent deaths from coronavirus is a national effort, and you can identify three broad categories that every citizen has been placed into:

  1. Those infected with coronavirus
  2. Those being urged to stay at home to slow the spread
  3. Those with ‘essential’, ‘critical’ or ‘key worker’ roles allowed to travel and work

Around each, you can map a system of connections which is affected in different ways — and which affects those groups in turn. This process — which I’ve written about in detail here — involves starting with one person affected, and then moving to the people, organisations, concepts, documents and data that they lead you to.

For example, the story about who gets the disease, what symptoms they display, what treatment they require, and whether they survive, is a story that has evolved through a number of stages — age, gender, ethnicity — and from basic demographics to a reflection on divisions in our society. And it’s a story rooted in data.

A similar story can be reported about those affected indirectly: the people left without a partner, parent, child, or sibling, or the children who need to be looked after — the patterns that can be seen in their experiences might highlight systemic failures or successes; the variation will highlight the inequities.

Essential workers present a second series of interconnected systems to consider: health workers and care workers may be the front line, but they require managers to coordinate resources; supply chains to provide those resources; cleaners and other workers to maintain facilities; teachers and childcare to look after their children; transport to get them to and from the hospital; supermarkets to keep them fed; police to maintain order; and, yes, journalists to keep them informed, shine a spotlight on their experiences and concerns, and scrutinise those in power for the decisions that can be the difference between a hospital operating within capacity — or being overwhelmed.

Even at this level, we are touching on education, social care, transport, the media, politics, and the food supply — any of which can be mapped and explored for data sources.

It is worth looking at official definitions of these workers to find other roles you may not have considered. Burial staff and financial services, for example, are included in the UK list, while the US list mentions dams and nuclear reactors and waste.

The final level is everyone else, and here is where the knock-on effects multiply. We can look at:

  • Sectors that have had to close entirely (restaurants, tourism, beauty and sport, for example)
  • Sectors that have had to adapt, largely to online delivery (education, religious organisations and some events businesses, for example)
  • Sectors that have had to reduce operations (taxis, dentists, fuel and energy providers, office supplies, automotive services, or legal services for example)
  • Sectors that have had to increase operations (home delivery-based retail, YouTube fitness and yoga videos, or plain opportunists, for example)

Some sectors will fall into more than one category, of course — and finding data will tell us just how much, where, when and how this is happening.

Remember that many there are many invisible industries behind the more visible ones. Restaurants, for example, rely on food suppliers, who in turn rely on farms, who in turn rely on agricultural workers. They use electricity, and fuel, hire cleaners, and buy advertising. Mapping these systems can lead you to industries — and data — you might not have thought of.

Then there are the parts of society that are invisible in other ways. The black economy, for example, is also affected: drug running has been interrupted, prices are affected and drug dealers are adapting, while sex workers are putting themselves at risk to maintain an income. More broadly, those reliant on cash-in-hand work are unable to be furloughed. And there is the unpaid work of care and domestic labour.

As physical movement is replaced with virtual movement, we leave data trails in different ways.

Changing rules, changing behaviour

As the rules change, so human behaviour changes too. Curious anecdotes — such as the driver doing 150mph on a motorway — can be the spur to find out just how unusual that behaviour is. You can ask the relevant bodies (police forces in this case) for data about that behaviour right now, but you can also make a note to dig into data on broader categories of incidents (e.g. traffic offences) when it is publicly released. And in the meantime, you can look for historical data to put that story into context.

Questions such as “What do people do when they aren’t allowed to get a haircut?” can lead you to other stories, too. Google Trends can be particularly useful for some of these — not just how many more people are searching in your area for things like “How to dye hair” but related breakout queries like “How to get hair dye out of clothes”.

If you have access to social media monitoring services like Dataminr or Crowdtangle — or are willing to scrape a sample of updates from social media sites — then you can monitor changing patterns in what people talk about, too. Brandwatch, for example, analysed social media chatter and found that “a peak in news mentions of stockpiling came just before a spike in social mentions of … things being out of stock.”

The tricky thing with scraping chatter, however, is distinguishing between talk about something, and talk that reflects actual behaviour. Brandwatch’s analysis of chatter, then, is complemented by scraping of online retailers which shows when “out of stock” listings peaked.

As physical movement is replaced with virtual movement, we leave data trails in different ways. Many app developers and website owners may be able to provide insights into their users’ behaviour, whether those are apps specifically developed for the current situation, or apps that can show how user behaviour has changed since before a lockdown. The New York Times’s “The Virus Changed the Way We Internet” is one of the most comprehensive examples.

Apps which monitor movements — such as those for fitness or travel — may be obvious candidates, but don’t overlook apps that might shed a light on behaviour such as home working, cooking and the availability of ingredients (e.g. recipe apps), and even crime. Don’t forget to look at apps and online services used by businesses and authorities, too, such as managing logistics or reporting problems, or the companies providing third-party services to app companies: Reuters’s article about changing travel habits, for example, was based on data from a mobile analytics company. The content that we consume on YouTube and other platforms, and the bestsellers on major retailer websites can also be quantified (often through scraping) and analysed. The key is to identify the items or terms that relate to the topic you’re interested in.

It’s important to remember that ‘data’ does not mean ‘spreadsheets’, or even ‘numbers’. Data is any structured information.

Thinking creatively about data

It’s important to remember that ‘data’ does not mean ‘spreadsheets’, or even ‘numbers’. Data is any structured information: it might be web pages or documents that follow a template, or social media updates that can be quantified in some way, or events. It also does not mean ‘statistics’. Statistics – from the same root as the word ‘state’ – are often provided by public authorities. But we can also source data from private organisations, too. Understanding this — and being able to spot structured information and the opportunities that it throws up — is central to seeing story leads and ideas you might otherwise miss.

Sometimes it means creating structured data yourself through compilation and classification. See, for example, how the BBC turn the times and dates of events into charts that allow the reader to compare how different countries reacted to the pandemic.

It’s also useful to think about proxy data: that is, data which acts as a proxy for the thing you are looking for. Air pollution data, for example, can be a proxy for transport activity; energy consumption data can be a proxy for economic activity; waste collection data is a proxy for people moving away or working elsewhere. A spike in people dying at home can raise questions about what that indicates. Social media chatter and search trends are regularly used as proxies for behaviour, too.

Supply and demand are useful factors to consider when it comes to both changing behaviour and potential proxy indicators: high demand and/or low supply drives prices up, and drives knock-on demand for related products: whether that’s a lack of hairdressers leading to increased demand for hair clippers; or how demand for ventilators has led to a shortage in related drugs and forced some doctors to use “unfamiliar alternative drugs [or those] with greater side effects”.

And if you are collecting data yourself, simply sharing the data can be part of your journalistic output: The LA Times and The New York Times are just two organisations that have shared the data that they are compiling.

Looking back

There is an important exception to the focus on new data: historical data can also be given a new lease of life when it puts what is happening now into context. The Spanish Flu, for example, is not a story that readers are normally interested in reading about — but that history has suddenly become very topical.

And modern pandemics like HIV/AIDS and Ebola also become important in helping us put the current pandemic, and the actions being taken (or not), into context — just remember to avoid 3D and treating current coronavirus death counts as comparable to the final death counts of previous pandemics.

And it’s not just pandemics themselves. Historical data can also be used to shed a light on questions regarding our near future, like “Will the coronavirus lockdown lead to a baby boom?” or what happens when unemployment rises or pollution falls.

It can also provide context for current changes: this BBC data unit story on GPs’ shift to online and telephone appointments drew on data showing just how rare the practice was before the lockdown.

Reporters can look at news from other countries to identify data leads they can pursue in their own country.

Interactivity as a data angle

Another way you can use data to provide a fresh angle on a story is by using it for interactivity. The Washington Post’s coronavirus simulator, for example, has broken traffic records not because it includes new data, but because they have used data to drive an interactivity which is innovative, different — and useful.

Reuters’s “Breaking the wave” and the Economist’s “Tracking COVID-19 excess deaths across countries” take existing data and create a new way of looking at it.

It may be, then, that your fresh angle on coronavirus is simply a new way of presenting the information or helping users to engage with it. And new initiatives — from lifting lockdown to antibody testing to vaccines — present new opportunities for explanations which allow users to explore the complexities of factors such as false positives, the reproductive value and viral overload.

Interactivity is just one way that you can set your own reporting apart from others; clear visualisation, strong case studies and expert interviewees are just some of the others, so be prepared to adapt your reporting as you inevitably find others publishing stories in similar areas.

Looking ahead, planning ahead

One of the notable features of the coronavirus story is that it is at different stages in different countries. Countries where the first cases of the pandemic appeared are weeks ahead of countries where it appeared later, and different countries are taking different steps at different times.

Reporters can look at news from other countries to identify data leads they can pursue in their own country. When other countries began to release their prisoners, for example, it prompted journalists to dig into data on their own prison populations: how many might be eligible for early release, or that belonged to vulnerable groups.

As well as looking ahead we can also plan ahead: most data about the lockdown period itself will only emerge months after it is coming to an end — potentially, even, as new lockdowns are announced. So while reading anecdotal reports and single data points about crime, transport, business and welfare, bear in mind that there may well be publication already scheduled for full comparable data on those topics in the coming months that journalists should be planning for.

When that data is published expect to see a number of stories looking back at the impact the lockdown had on each aspect of life the data relates to, which “new data reveals”. You can prepare for this by looking at previous releases, compiling historical data and understanding what you will need to do to analyse the new data when it is released. That will help you turn around those stories more quickly. If you can code you can also write scripts to perform the data analysis as soon as it is released. And if there was ever a good time to learn to code, well, that time is now.

For more on COVID-19 reporting, check out:

]]>
Simulating a pandemic https://datajournalism.com/read/longreads/simulating-a-pandemic Mon, 30 Mar 2020 12:37:00 +0200 Tara Kelly https://datajournalism.com/read/longreads/simulating-a-pandemic When Harry Stevens joined the Washington Post as a graphics reporter in September 2019, he never imagined a story he’d publish six months later would become one of the most viewed articles ever on the newspaper’s website. The interactive piece he visualised circulated around the world, showing how a disease could spread through a number of different scenarios, including taking up social distancing.

His motivation for the piece was simple. “Because social distancing was a relatively new phrase for most people, I wanted to show how a disease like COVID-19 spreads,” says Harry. The outcome showed social distancing was the most effective way to flatten the epidemiological curve of a disease, even more so than China’s imposed quarantine.

After the former U.S. President Barack Obama tweeted the visualisation out to his millions of followers, it wasn’t long before Harry’s coronavirus simulator caught the attention of public figures throughout the world. “I saw the Venezuelan dictator Nicolas Maduro sharing it on state television,” says Harry.

Even celebrities like Shakira shared a video on Instagram and Twitter referencing the simulation while asking her fans to stay home. The message? Practising social distancing could have a major impact on slowing the spread of the virus by flattening the epidemiological curve. The graphic explained what public officials couldn’t with words alone. In a bid to democratise information during this pandemic, The Washington Post decided to lift its paywall for certain COVID-19 content. Fortuitously, Harry’s piece was one of them. This, no doubt, contributed to the 27,000 likes and 96,000 shares of the article on The Washington Post’s Facebook page.

Gone global

While Harry says he received both praise and accolades from mathematicians and scientists, readers also reached out to tell him how his piece brought a sense of hope to a woefully uncertain situation. “There’s definitely been an emotional response to this piece,” says Harry. “This is a very anxious time for a lot of people. But when you see that you can change the outcome of this by modifying your own behaviour, it gives you a sense of control.”

Soon readers reached out wondering if it could be translated into other languages. “We had a lot of people saying I want to share this with my parents, but they don't speak English. Can I translate it into Romanian or whatever language they speak?’,” he says. The story is now available in 13 different languages thanks to readers volunteering to translate it themselves followed by the newspaper proofing it.

The kernel of an idea

Like many data stories, the idea for Harry’s coronavirus simulator piece came to him at a pitch meeting while brainstorming with fellow reporters. With a background in frontend design and web development, Harry remembered a Javascript code on network detection that he’d developed a year ago and wondered how it might apply to COVID-19. “I had the code sitting around and I showed it to the team,” he says. “I suggested we might repurpose it to simulate how things spread through a network and how social distancing works.”

The simulation intended to show how networks interact and their exponential nature of growth, not forecast the disease.

After a nod from his editor, he created a number of prototypes. As he began to tinker with the design of balls bouncing around the screen, he knew the data for the piece would determine everything. “At first I wanted to use real-life data from the COVID-19 and simulate the actual virus,” he says. But, after a conversation with Lauren Gardner, an associate professor in the Department of Civil and Systems Engineering at Johns Hopkins Whiting School of Engineering, he realised it would be impossible to accurately represent COVID-19’s spread in real-life. As a professional forecaster of trajectory outbreaks, she explained the process of simulating models showing COVID-19 cases and deaths. It required a team of PhDs to run computationally intensive mathematical models on supercomputers for hours and hours. But even then, she warned him much uncertainty existed in the results.

While he received some criticism from readers about the simulation not showing how COVID-19 would unfold, he explained doing so would be impossible. Others wondered why he didn't have some of the balls (representing people) die off. But the simulation intended to show how networks interact and their exponential nature of growth, not forecast the disease. “The point is that there is no way that I could simulate COVID-19 in the real world. That's why I made a fake disease and made it clear in the piece that it's a fake disease called simulitis,” says Harry.

To explain the phenomenon of how a similar virus to COVID-19 could spread exponentially, with or without social distancing or quarantine, he decided to create coloured balls bouncing against each other showing sick, healthy, and recovered individuals. He used randomised data for his fake disease simulitis, not COVID-19 data. The prototype was adjusted accordingly based on the made-up disease’s randomised data. “I think that even though it was so simple, it still mimicked the growth curve that we see in the real data,” says Harry.

The visualisation showed the spread of a simulitis over a number of different scenarios. He used different coloured dots representing healthy, sick, and recovered people bouncing around the screen. The outcome of the spread was shown through the following scenarios: a) no quarantine or social distancing measures b) an attempted quarantine c) moderate social distancing d) extensive social distancing. Social distancing proved to be the most effective measure, even over forced quarantine as taken in China.

But the one part of the story that did use actual real-life COVID-19 data was the exponential curve of confirmed cases in the United States. This was necessary to set the scene and show the steep growth curve of the disease in the country. Using the data set from Johns Hopkins University, the epidemiological curve showed confirmed cases from the first detected case in the country on 22 January 2020 to 13 March 2020, the day before he published the story.

For his graph, he chose the COVID-19 data set from Johns Hopkins University due to its accuracy in data collection. “In the US, they've been very carefully collecting data. Because there's no central repository of all cases and deaths in the United States. You can't just go to a CDC website and get that,” explains Harry. “Johns Hopkins has been contacting all of the states, counties and collecting their data and putting it in a central database.”

Behind the design

Amongst the hundreds of messages from readers, a vast number of requests came through asking how he technically designed the story. While The Washington Post doesn’t share its code on GitHub, the original experimental code that inspired his story is published online here. “I repurposed a lot of that code, so it shouldn't be too hard for someone who knows JavaScript to also spin up a simulation from it,” says Harry.

For the graphic at the top of the story, he used D3.js, a Javascript library for manipulating documents based on data. As for developing the simulations, he used Geometric.js, a library designed for computational geometry. Instead of using an SVG for the simulations that could lag the loading of the webpage, he opted for Canvas API, which delivered a smoother experience for the user.

For the exponential curve showing real data from COVID-19, he wrote a web scraper that pulled the data set in from Johns Hopkins University’s GitHub page. Every couple of days while designing the interactive, he would update the scraper to see if the data from the curve had flattened. Instead, Harry found the curve grew sharper and sharper by the day.

Data journalism takes centre stage

By persuading behaviour change in a global health crisis, the article has served as an exemplary case study for the wider data journalism community. Factual information has never been more in demand, but now there’s an even greater need for making data meaningful to audiences. And that applies to coverage about and beyond COVID-19, too.

For Google News Lab’s data editor Simon Rogers, a leading voice in data journalism, he believes the impact of the story is clear: “Data journalism has had these moments that have made it more important. How many people now are looking at epidemiological curves and understanding them now because of data journalists?”

My hope is that one of our biggest learnings is to continue to focus on iterating on how we illustrate uncertainty better

Simon isn’t alone. Amanda Makulec, a senior data visualisation lead at Excella and the Data Visualisation Society’s operations director, believes Harry’s story is a shining example of responsible data storytelling. “It belongs in a top-five hall of fame for being an illustration of data that helped inspire an entire country, and really the world, to make certain choices that are really hard to make around limiting your activity,” she says.

She advocates that the power of data visualisation is helping people understand complex concepts. In a recent Fast Company article she penned, she warns readers to be wary of misleading charts, graphs, and maps, while also offering advice on how audiences can interpret data visualisations as intended.

She also highlights the need for people to create responsible designs that add value and make an impact: “How are people going to see it and recognise it in these uncertain times? You don't want to have put something out there that can mislead somebody. And right now, misleading information isn't what we need in the public sphere.”

As for what journalists and designers can learn from this pandemic, her answer is simple: “My hope is that one of our biggest learnings is to continue to focus on iterating on how we illustrate uncertainty better.”

]]>
Data, diets and disease https://datajournalism.com/read/longreads/data-diets-and-disease Wed, 26 Feb 2020 15:00:00 +0100 Aneri Pattani https://datajournalism.com/read/longreads/data-diets-and-disease Drink less coffee to be healthier, suggested a recent article on MSN, citing the fact that coffee can increase anxiety. Two days later, a story on Yahoo recommended the exact opposite. People should drink more coffee, it said, because the beverage may reduce the risk of type 2 diabetes and high blood pressure.

Similarly, it’s been reported that chocolate can improve sleep and prevent depression, but also may lead to an earlier death. A glass of red wine can help you live longer, according to some articles, and increase your risk of cancer, according to others.

These stories, all of which cite scientific studies, embody a common criticism of health journalism: it misrepresents scientific findings, resulting in a nonsensical series of articles claiming some food is healthy one day, and harmful the next. The whiplash in reporting on scientific research is so pervasive that it inspired an entire episode of ‘Last Week Tonight’ with John Oliver.

But the real harm of this type of health reporting is that it can erode public trust in science. It can lead people to believe there are no true findings at all, and fuel movements like climate change denial and anti-vaccination campaigns.

However, if journalists manage to navigate around the minefields of hyperbolic claims, complex medical trial data, and potential conflicts of interest, health reporting presents a real opportunity to provide a public service. Because health affects us all, thoughtful and responsible health stories can empower communities to make important social and policy choices, as well as advocate for their own health.

To deliver such stories, journalists need to learn how to interpret studies responsibly, investigate potential bias and ethical concerns, and provide broader context around new findings. Although it may seem intimidating, even a basic understanding of these concepts can go a long way toward public health journalism that serves the public good.

All studies are not created equal

Academic research of all kinds is often referred to as a study. “New study shows soybean oil can cause genetic changes in the brain,” a headline might say. Or “study finds this medication can reduce the risk of bowel cancer.”

But too often the details of those studies -- what was done, how significant the findings are, and what limitations the study has -- are left out. Like the fact that the study on soybean oil and the study on bowel cancer medication were both done in mice.

That’s a key difference, said James Heathers, a data scientist at Northeastern University in Boston. Although mice studies are a valuable part of medical research, their findings are not always transferable to humans, he told STAT News.

That’s why he’s created a twitter account called @justsaysinmice, which retweets articles about scientific research often with just two words added on: IN MICE. The goal is to help the public, and journalists, recognise the importance of that difference.

As a health journalist, one of the most crucial things to do is understand different types of studies and the benefits and drawbacks of each. Here’s a crash course:

Test tube and animal research is highly preliminary and should be treated as such. Many news publications won’t even report on these types of studies because it’s unclear what human impact, if any, they will have.

Ideas, editorials, and opinions from knowledgeable researchers can be interesting, but are not yet at the threshold of scientific evidence.

A case report is the publication of a single case in a journal or other academic forum, typically because the case is unique and significant.

While such cases can be fascinating, journalists should be careful not to generalise their findings.

For instance, a recently published case report on scurvy diagnosed in an otherwise healthy 3-year-old boy should not be reported as “new study shows all children may be at risk for scurvy.”

In reality, all that’s been found is one instance in which a child without typical risk factors, such as malnutrition or Crohn’s disease, had scurvy. It could be an anomaly rather than a generally applicable phenomenon.

A case control study is a study that compares patients who have a disease (cases) with patients who do not (controls).

It is not an active intervention, but rather a retroactive observation of what was different between the controls and the cases.

Those differences may or may not be meaningful, and this type of study cannot indicate if the differences caused the disease or not.

A cohort study involves following one or more samples of subjects (called cohorts) over time.

Just like a case control study, there is no intervention here. Instead, it is a long-term observational study.

The Framingham Heart Study, which began in 1948 in Framingham, Mass., is one of the most well-known cohort studies. By observing the participants over decades, seeing who developed heart disease and what their lifestyles were like relative to those who did not, researchers were able to identify several risk factors for heart disease.

The findings of randomised control trials are considered some of the most trustworthy. Participants are randomly assigned to either get an intervention (i.e. receive a drug, start a new diet, get a vaccine) or not. The results of the different participants are then compared.

For even greater confidence, some randomised control trials are double-blinded, meaning neither the patients nor the researchers know who has received the intervention and who has gotten the placebo. This avoids bias in the reporting and interpreting of the data.

The smaller the sample size of a study, the more reason to be cautious of its findings.

Meta-analyses and systematic reviews are studies where researchers amass data from a number of previously published studies on the same issue and reanalyse the data together. These are considered some of the best forms of evidence since they’re based on multiple experiments. Yet, they rarely generate news.

In addition to these different study methods, journalists should also note that medical studies done on a new drug or therapy are often referred to by a certain phase. ° Phase I -- when a drug is being tested for safety, typically on a small group of human volunteers

° Phase II -- when a drug is being tested for efficacy on a slightly larger group of people

° Phase III -- large-scale testing, often involving randomised and blinded study designs on large populations

° Phase IV -- after a drug has been approved for consumer use by the FDA; often testing its long-term efficacy or comparing it to other drugs on the market

There are no hard and fast rules about which types of studies should or should not be covered, but understanding the differences can help journalists make educated decisions and convey the significance of any particular study appropriately.

Understanding research statistics

After learning what type of design a study is using, reporters have to be able to interpret the findings. While it can be easy to skim past unfamiliar terms or seemingly complex numbers in an academic paper, this data can be a critical tool for evaluating the study’s merit.

While health reporters don’t need to be statisticians, knowing three basic statistics can be helpful:

1. P-value is a statistic meant to indicate the probability that a finding is due to chance rather than the intervention being tested.

Generally, the cutoff for p-values is 0.05. A p-value lower than that suggests confidence in the findings, as there is less than a 1 in 20 likelihood that the finding is due to chance. The lower the p-value, the greater confidence one can have in the findings. Conversely, a higher p-value means the findings are less statistically significant and could be coincidental.

However, even a low p-value does not mean the findings should be taken at face value. P-values can be manipulated -- often referred to as p-hacking -- or they can simply be misleading. FiveThirtyEight illustrated this point by conducting a survey in which they gathered demographic data and asked people about their diet. The reporters then went on to run regressions on that data and found statistically significant relationships (meaning p < 0.05) between eating raw tomatoes and being Jewish, and eating cabbage and having an innie belly button.

FiveThirtyEight graphic

Obviously one does not cause the other in these cases, no matter what the p-value -- which is why it’s crucial to look at other study statistics too.

2. Confidence interval is a range of values within which researchers are fairly certain the true value lies.

Many medical studies may have a statement like the following:

“This study reports this result as a relative risk reduction for the development of prostate cancer of 22.8% (95% CI: 15.2-29.8; P < 0.001) for patients taking dutasteride compared to patients taking placebo.”

That means that 95 percent of the time, the relative risk of developing prostate cancer is between 15.2% and 29.8% lower for patients taking the drug versus those taking a placebo.

The smaller the range of the interval, the more confidence one can have in the findings (e.g. 2.46-5.63 is better than 21.93-132.3).

Also, the lower and upper values of the interval should either both be positive or both be negative. If the range contains zero -- for example -1.03-3.02-- the finding is not statistically significant. Taking the previous example, if the confidence interval for the relative risk reduction with the drug was -1% to 4%, that would mean sometimes the drug reduces the risk and other times it doesn’t. Thus, the finding is not significant.

3. Sample size -- the number of participants or other type of population in which the study was conducted, often written as n.

The smaller the sample size of a study, the more reason to be cautious of its findings. If a study was only done in white men over the age of 50, the results may not apply to black women in their 20s. Reporters should clearly state in their articles what sample was used and what that means about the generalisability of the results.

Uncovering conflicts of interest and other ethical concerns

By now, it’s been widely documented that the tobacco industry funded research to undermine the harms of smoking and that the fossil fuel industry hired scientists to discredit climate change. Yet for years the findings from these industry-backed studies were reported as fact, without caveats.

Luckily, today there are many tools to help reporters uncover funding sources and potential conflicts of interest that can help avoid repeating the mistakes of the past.

It all starts with the basics: read to the bottom of the study. Often at the end, researchers will list their affiliations with trade organisations or companies, and the source of their funding. Of course, they may not be completely honest, but it’s a good place to start.

Credit JAMA Netw Open. 2020;3(1):e1919940. doi:10.1001/jamanetworkopen.2019.19940

Another way to go is to ask the researchers outright: “How is your study funded? Do you have any conflicts of interest or associations that might influence the way the public perceives this study?” Again, they may not respond honestly, but it never hurts to get them on the record.

The best way to uncover ethical concerns is to do some independent sleuthing.

Start by running a Google search of the study authors. It’s surprising how often something so simple will produce interesting information -- a pharmaceutical company’s press release about the author coming to work for them or a conference presentation where the author promoted a particular drug.

Then check the researcher’s LinkedIn profile. Are they consulting for companies in addition to their academic work? Did they spend five years in an industry-funded think tank?

If that doesn’t yield much, there are several databases designed to shed light on this exact issue.

° CMS Open Payments -- This U.S. federal database includes information on industry payments to physicians and teaching hospitals. You can search by doctor, hospital, or company. The data is free to download.

° ProPublica’s Dollars for Docs - This tool allows reporters to search for payments from pharmaceutical and medical device companies to a variety of doctors and U.S. teaching hospitals for everything from promotional talks to research and consulting. ProPublica updates the database every so often. Using the search tool is free, but downloading the dataset costs $250 for journalists.

° ProPublica’s Dollars for Profs - Similar to the Dollars for Docs, but for university researchers. The database allows you to search records from multiple state universities and the National Institutes of Health for outside income and conflicts of interest of professors, researchers, and staff. It’s a limited sample of universities, but a good place to start. The search tool is free, but downloading the entire dataset can cost anywhere from $200 to $3000 depending on the use.

° Kaiser Health News’ Prescription for Power - If the study involves or is promoted by a patient advocacy group, it’s a good idea to check if the group receives funding from pharmaceutical companies. Although this database hasn’t been updated in a few years, it remains a useful tool.

Keep in mind that studies with negative results often don’t get published at all, so that can skew the results that surface.

Financial conflicts are not the only ethical concerns that can occur in the research world. Sometimes study authors can manipulate their data or interpret results in a biased manner to increase their apparent significance. For example, researchers might initially set out to test the effect of a drug on blood pressure. But when the results come in, they realise the drug didn’t affect blood pressure the way they hoped. Yet it did lower cholesterol. So then the researchers might change the measured outcome to cholesterol in order to show positive results. This is flawed study design, yet it can be difficult to know unless you’re part of the research process. Fortunately, there’s a tool to help with that too.

ClinicalTrials.gov collects data on publicly and privately funded clinical trials from more than 200 countries. It includes information on the study design, participants, outcomes to be measured, and results. Most importantly, it is updated throughout the lifetime of a study, and historical versions of the information get saved on an archival site.

Reporters simply need to get the NCT Number, a unique identifier for every clinical trial which can be found at the top of the study page on ClinicalTrials.gov, and enter it on the archival site. The tool will then show all previous versions of the study details. It can even highlight changes between any two versions (red for older versions and green for newer versions).

Credit ClinicalTrials.gov

If the study design changed from the original proposal to the final paper, that doesn’t necessarily mean there is unethical conduct. But it’s important to ask the study authors about the reason for the adjustments.

Another website to keep an eye on is Retraction Watch, which tracks papers that get retracted due to some error. These often don’t get any media attention, but they can reveal important flaws in the scientific system. It’s worth monitoring the website for highly publicised studies that may be retracted, as well as to learn about the types of errors that are occurring and some of the researchers who are making them. That doesn’t mean you should never interview researchers who’ve had a study retracted. After all, people in all fields make mistakes sometimes. But it’s a good reminder to remain cautious.

Questions to ask the study author(s)

After reading through a study and investigating it as best as possible, the next step is to present questions directly to the researchers. They’ve spent the most time with the material and can say the most about it. Many researchers are eager to share their work with the public, and these can be very fruitful conversations.

But it’s key to go in prepared with the right questions. Here are a few -- by no means a complete list, but a place to start:

° What was the original study question? If it changed, when and why?

° Were any types of participants excluded from the study? (i.e. if anyone with heart disease was removed from the study of a high blood pressure medication, then the findings are limited)

° Did people drop out of the study midway through? If so, how many and why?

° What was the intervention being compared with? Was there a control group? Was it being compared to another drug? If so, is that drug representative of what’s currently on the market or is it an older version?

°What are the benefits of the intervention? And what are its potential harms/adverse effects?

° How easily available is the intervention? What does it cost?

° What are the limitations of the study?

° How generalisable are your findings?

° How do your findings fit in with the existing literature in this area?

Providing context for the reporter and the reader

As it’s likely clear by now, health reporting is complex and nuanced. That means one of the most crucial things reporters can do for readers is to provide context. Yet journalists are not scientists, so first they must seek out the context themselves. Here are a few places to get started with that:

Find a biostatistician to be a regular, go-to source who can comment on studies. While knowing basic research statistics like p-value and confidence interval is helpful, a biostatistician can get into the weeds and alert journalists to questions they need to ask or concerning study elements they may have missed. Most local universities have statisticians who can act as expert sources.

Read meta analyses and systematic reviews on the subject. As mentioned earlier, these are papers that take data from several different studies and reanalyse it together, giving a much larger sample size and more confidence in the findings. Meta analyses and systematic reviews can provide perspective on where the public health field currently stands on any given subject, and whether the new study makes sense in that realm or if it is too far off base. The Cochrane Library is a good resource for systematic reviews, as is a search with the keywords “systematic review” on Google Scholar or PubMed.

Keep in mind that studies with negative results often don’t get published at all, so that can skew the results that surface. To that end, sometimes it’s helpful to talk to others in the field.

Have an outside researcher comment on studies as a regular part of the reporting process. As scientists themselves, they’ll be able to spot important findings as well as study flaws better than most reporters. They can be found through local universities, professional associations (i.e. American College of Cardiology), or the other papers cited in the study of interest.

Public health reporting can be incredibly complex, especially for journalists without a background in medical science. But for diligent and compassionate reporters, it is also an opportunity to provide a public service. They simply need to master the tools to cut through the inflated claims and misleading rhetoric. That way, they can ensure pharmaceutical companies or individual scientists are not the ones needlessly gaining power from the reporting, but rather it is readers who walk away empowered to make better-informed health decisions.

]]>
Bringing the power of data to deadline stories https://datajournalism.com/read/longreads/how-to-bring-the-power-of-data Tue, 11 Feb 2020 11:41:00 +0100 MaryJo Webster https://datajournalism.com/read/longreads/how-to-bring-the-power-of-data Reporters give me a sceptical look when I tell them that it is possible to apply data analysis to stories that don’t require months of their time.

Too many reporters and editors equate data journalism with big, investigative projects, which is not surprising since that is where the field is most prominent and gets used the most.

But there are huge benefits to be gained from applying a little data to those quicker-turnaround stories, too, and it can be done on almost any beat or topic area while still meeting tough deadlines and keeping copy flowing to your editors.

Getting started with applying data techniques to your day-to-day workflows requires a little upfront investment -- including at least some basic data training -- but you certainly don’t need to be a veteran data journalist to pull this off for short-term enterprise stories, follow-ups to breaking news stories, and maybe even a daily story. In this Long Read, I’ll walk you through some ways to build data analysis into your everyday work and show you how some of my Star Tribune colleagues are doing so in practice.

Following the money is always a good data-driven option, regardless of your beat.

Mindset

I’m a big believer in the power of a “data state of mind” and being able to navigate a spreadsheet. Thinking about data as a source, instead of as something special or different, paired with some basic Excel training, will get you off the ground.

Let’s look at an example.

Star Tribune education reporter Faiza Mahamud wanted to do a story about the lack of diversity among Minnesota’s elementary and secondary school teachers. A state report showed that only 5% of teachers were people of colour, compared to 34% of their students.

Those two numbers were enough to indicate there was a story worth pursuing, but she had other questions. What schools had the greatest gap between the diversity of teachers and diversity of students? Were there any that had made progress in closing that gap in recent years? And what schools should she visit for interviews and photos? She could have made calls to education leaders in the state to try to find answers. But she had just learned how to use spreadsheets and she suspected data could help.

Two fairly basic and easily available datasets from the state teacher licensing board and the state department of education, identifying the racial breakdowns of teachers and students, gave her the answers to all of those questions and helped her publish a story shining a light on the lack of teacher diversity.

Finding the targets

Certainly, not every story is going to need data, nor will there be time to use data on everything. So the first hurdle is picking your targets. Enterprise stories, like the one Faiza produced, are a natural target. But what if you don’t have an idea in mind?

I frequently sit down with beat reporters in my newsroom and ask them a couple of simple questions: “What topics keep coming up over and over on your beat?” and “What are people talking about that you’d like to prove is true or not true?” After they answer my questions, I tell them: “There’s your story.”

Emma Nelson, a city government reporter, had been hearing that the city of St. Paul had been condemning a lot of properties simply because owners had failed to pay water bills. She wondered if it was true.

The city didn’t track this data, but she found out that they could provide her with a spreadsheet with dates and addresses where condemnation letters had been sent. However, she had to read hundreds of PDF copies of the actual letters to find the reason for condemnation, and manually add that to the spreadsheet. She gradually worked on this story, in between other assignments, over a few months. Once finished, the analysis was pretty simple.

Editors want you to meet with those important human sources, so why not take some time to ‘have a coffee’ with your data sources, too?

Her story prompted the city to approve a policy giving city residents more time to appeal water bills and shut-offs, and to start tracking this themselves.

The other place to look for a story is when you’re spending long stretches of time covering one issue. Editors are demanding that you stay on top of the breaking news, but also generate enterprise stories. This is always a good opportunity to see if data can help get you a better story.

Mila Koumpilova found herself in this situation as the higher education reporter, covering the University of Minnesota’s effort to replace its retiring president. This year-long process yielded story after story, but she saw a data opportunity by following the money. The university had hired a search firm to find candidates for the job. She wondered how much they were being paid. And how much did the university spend for search firms to help fill other jobs? With the data, she was able to write an eye-opening story about how extensively the university used search firms. Side note: following the money is always a good data-driven option, regardless of your beat.

Meet your data

You will also be more likely to find stories if you know what data sources are available. When you start a new beat, usually your first task is to find out who the important people are -- the ones you need to be talking to on a regular basis. I tell reporters to expand that process and also find the important datasets you should be “talking to.”

I suggest they get a copy of the important datasets to not only learn the process of obtaining it, but also to tinker around with it in spare moments. Editors want you to meet with those important human sources, so why not take some time to ‘have a coffee’ with your data sources, too?

Being familiar with a dataset will make it far more likely that you can put it to use in the future. I’ve had many situations where a reporter or editor came to me to discuss an upcoming story and because of past experience with a dataset, I could quickly envision what would be possible.

Sometimes the data you want just isn’t available. That’s when you can really get a special story -- by building it yourself.

That also makes it more possible to add data analysis into a daily or very quick-turn story. That’s what sports reporter Ben Goessling did about three weeks into the 2019 American football season.

He knew of a website where he could download numerous years of team statistics that included a key metric: how often teams used a run play versus a passing play. He wanted to put the Minnesota Vikings 2019 season performance into historical context. More than 60% of the Vikings’ plays had relied on running, which even a moderately knowledgeable Vikings fan knew was atypical.

Ben downloaded about 10 years’ worth of team statistics, then added a column identifying how each team fared in the playoffs that year. His story featured a quick analysis that showed the most successful teams had a more balanced offence, running the ball 40 to 45% of the time. I should note that he did all this in less than a week, while also writing daily stories and doing a podcast. If you read the story, you’ll see that his analysis resulted in a couple of sentences and a graphic, but it added support to the rest of his reporting.

Recurring datasets

When you go hunting for datasets, also find out when they are regularly updated. You’ll usually find some that might be newsworthy when the new data is released.

In those cases, you can count on that data to yield annual stories and sometimes even produce unexpected stories you hadn’t thought about.

In these situations, the key is planning ahead. It helps to have a rough idea of what story you think the data can help you tell, or at least have some questions you are hoping to answer. You don’t want to dive into a dataset without a game plan.

Business reporter Christopher Snowbeck routinely mines Medicare cost and enrollment data that comes out each year to bolster his reporting on how that business is changing and what senior citizens can expect when they are shopping for plans. Before the data arrives, he already has questions he wants to ask the data based on what he’s hearing from his sources. Often his analysis yields a sentence or two that supplements his traditional reporting, or confirms what he’s heard from others.

Education reporter Erin Golden mines data on graduation rates and school test scores every year to keep readers up to speed with student achievement. There are always the basic stories around whether rates improved from previous years or not. But her familiarity with the data, from using it every year, also helps in other ways.

In 2019, the state education department put out a press release outlining the school test scores that attempted to put a positive spin on results. She and I looked at the data and saw nothing positive about it. The data showed flat or even declining test scores almost everywhere, and among every group.

Without a prior understanding of the data, another reporter might have written a very different story based on that press release. Sometimes these data dumps can bring surprises, too, which aren't in those press releases. From that same dataset, Erin and I found another story: the increasing number of schoolchildren opting out of taking the standardised tests.

Breaking news

Reporters who have general assignment jobs, without a defined topic area, struggle with my recommendation to find datasets pertinent to their beats. But there are a lot of datasets that can be useful in breaking news situations that these reporters should have at the ready. Imagine being that reporter who stuns the editor with an impressive story putting a news event in context just a few days later.

I was in that situation in 2007 when a major bridge collapsed into the Mississippi River in Minneapolis, killing 13 people and injuring dozens of others. I was working at the St. Paul Pioneer Press at the time, and had previously dabbled with the federal bridge inspection database. I knew where to get that data and that it would allow us to look at which other bridges in Minnesota were considered deficient and in need of repairs or replacement. We published that story four days after the bridge collapse.

Look for datasets on vehicle crashes, plane crashes, building fires, workplace deaths, or gas line explosions. At the same time, get to know what agencies do inspections that are designed to prevent these catastrophes and see what’s in their data.

The key is keeping your analysis focused, with a manageable scope.

Get some basic data at the ready so you can answer questions like, “Have there been more murders in our city this year than any year in the past?” Have a good understanding of census demographic data to know when a city or county in your area is on the verge of becoming majority-minority or passing some other important benchmark.

There might also be data opportunities around big events in your area. Is your city hosting a festival or a major sporting or political event? For these, the data opportunities might arise after the event. Is there government spending involved that you could track? Is there data showing how much overtime law enforcement incurred to provide security for the event?

Also think about whether there are any important anniversaries coming up. Ten years after that bridge collapse, I helped a fellow Star Tribune reporter use data to look at whether Minnesota had lived up to its promise of fixing all the other deficient bridges. On the anniversary of a horrific injury to a high school hockey player, sports reporter David La Vaque, with some help from data journalist Alan Palazzolo, analysed penalty data to assess whether rule changes after that injury had changed the game.

Build it yourself

Sometimes the data you want just isn’t available. That’s when you can really get a special story -- by building it yourself.

Many news organisations in the United States, including my own, set out a few years ago to count how many people had been killed in encounters with police. We all discovered that law enforcement agencies weren’t doing a good job tracking this. So we did it ourselves.

My newspaper published our version of that dataset nearly four years ago, and we decided to keep it updated. Each time there is a new death, we add a record. Every few months, we take some time to review news stories from throughout Minnesota and the state’s death certificate database to find any we missed. Now when a new incident happens, our breaking news team has data at the ready to show readers how many similar incidents have happened this year or in this particular law enforcement jurisdiction.

Take this idea of tracking something and think about whether there are other places you could apply it. Are there city council decisions that you want to track in a way that the city doesn’t? Would it be worth tracking hate crimes and documenting more detail about each one than the law enforcement agency does?

A few years ago, sports reporter Marcus Fuller wanted to quantify something he noticed while covering college men’s basketball. He saw that there weren’t very many black head coaches, and he was certain the number had declined over time. Marcus, along with some help from an intern, ploughed through college websites and media guides and called athletic directors to build a simple spreadsheet of coaches at each of the top tier teams, going back in time. The data proved his theory and provided the backbone to an important story.

In the above chart, each year represents the year the season began. Schools in the Big 8, which became the Big 12 in 1996, are included under the Big 12.

Keeping it manageable

Now that I’ve planted a ton of ideas in your head, I’m sure you’re wondering, “How do I actually pull it off, in between all my other responsibilities?”

Yes, that is the challenge.

The key is keeping your analysis focused, with a manageable scope. How big that is will be different for each story and dependent on your data skills. If you have someone to help you in your newsroom, you can spread your wings a little more. If you are really new to working with data, or if you don’t have someone to help you, you’ll want to keep it very tight and try to pick datasets that are simple.

A dataset that has nine relational tables, loaded with codes that need to be translated using lookup tables, will probably be too much for a quick-turn story.

Also, think about the quantity of data that you are working with. Do you need 20 years’ worth of data to tell the story you want? Does it need to include every school in the state? Sometimes less data is sufficient, making it less time-consuming, too.

You want to be sure you thoroughly understand your data and that you’ve done the analysis correctly, but without spending too much time fishing for things you don’t need.

Keep in mind that you might not need to do data cleanup or standardisation on every field in the dataset. If it’s a field you aren’t going to use, just leave it alone.

Start with just a few questions that you want to answer, as Faiza Mahamud did for her story on teacher diversity. Try to stick to those questions and not let yourself veer off on tangents.

It might sound like I’m telling you to cut corners, but in data work, there are often things that can be trimmed without sacrificing the quality of the work. You want to be sure you thoroughly understand your data and that you’ve done the analysis correctly, but without spending too much time fishing for things you don’t need.

Here’s where a good editor or colleague can come in handy. This person doesn’t necessarily need to have data skills. Show her your findings as you proceed. Her reaction will help you figure out when you’ve hit gold, or where there might be lingering questions. She will help you see if something might be too good to be true, or where your analysis doesn’t jibe with what you’ve heard from sources. She can also help you steer clear of tangents, and keep you energised.

A data diary is your friend

You will also probably need to pick away at your data analysis slowly, over an extended period of time, while working on other things. Once you have some findings, perhaps your editor will be able to free you up a little to do the rest of the reporting and crank out the story.

Make your life easier by keeping a data diary. This could be a text document or in a paper notebook. If you are using a coding language like R or Python, your documentation could be annotated right in with your code.

Your diary will need to include basics like where and when you got the data, the name and contact information of the person/agency who provided it, and what is or isn’t in the data. Then, as you go through, document each step. Make a note about a field you cleaned up. Put down questions that pop into your head.

A key to being successful using data for daily or quick-turn enterprise will be having a supportive editor.

At the end of each session working with your data, write yourself a to-do list for the next time. All of these steps will make it easier for you to pop in and out of the data as efficiently as possible. You don’t want to waste a ton of time next week trying to figure out what you did last week.

You also need to carve out time in your calendar to work on your data. I find it helps to have a very specific to-do list that you can turn to when you have a free block of time, even if it’s just 15 minutes. Instead of writing down something vague like “work on my data,” break it into pieces. For example, a good to-do list might include: “call agency source and ask why field X is blank” or “clean up the date field” or “create a Pivot Table looking for X.” Keep each item very simple and limit your tasks to things that can be done in a limited period of time.

Another helpful trick: Use the first 15 minutes of your day for this work, before jumping into something else. Set a timer, if necessary, to remind you to jump back to your daily work.

Working with your editors

A key to being successful using data for daily or quick-turn enterprise will be having a supportive editor. Ideally, this person would be willing and able to let you slide on covering a mundane council meeting or writing a simple daily to give you spare time for something a little more ambitious. At the same time, though, you need to prove to them that you will deliver.

I’ve seen a lot of editors who are reluctant to free up reporters because they’ve been burned in the past by reporters who didn’t deliver. So a simple rule of thumb: Don’t promise more than you can deliver. Start small. Save your big ideas for later, after you’ve proven yourself.

Involve your editor in your story as much as you can. Encourage her to review your data findings. Get her excited about how the data will make for a better story. This is especially important if your editor isn’t data-savvy.

Unexpected benefits

Using data on a regular basis for stories has a lot of perks beyond just the good stories that it helps you produce.

Data can be great at helping you find the people who will bring your trend story to life, or the places you need to go for interviews and photos. We’ve used election and census data to find rural Minnesota communities that had just the right mix of voting trends and demographics that we were looking for. Health reporter Glenn Howatt used data on school vaccination rates to pinpoint which schools he should talk to for his story about places lacking herd immunity for measles.

Even some simple data can help you go beyond the press release that a government agency has put out, giving more context to a daily story.

Getting data helps you dive deeper into your beat or whatever topic you are writing about. You see the details, plus you can see the big picture.

Possibly the best perk, though, is that the human sources on your beat will come to know you as that person who is going to be fair and thorough, who doesn’t settle for one good quote. I’ve seen numerous reporters command more respect just by tapping into data on a regular basis. You can, too.

]]>
The unspoken rules of visualisation https://datajournalism.com/read/longreads/the-unspoken-rules-of-visualisation Wed, 29 Jan 2020 21:10:00 +0100 Kaiser Fung https://datajournalism.com/read/longreads/the-unspoken-rules-of-visualisation Visualising data is like solving a jigsaw puzzle. To be successful, there are some things you should know in advance -- the scene to be revealed by the pieces is the story you'd like the data graphic to convey. You also need to understand what you have at your disposal -- the pieces of the jigsaw are the set of data in front of you. And then it’s your job to piece together the puzzle, or assemble the elements of your graphic, to tell your data’s story.

Visualising data is a choice. Instead of words, we elect pictures. As the adage goes: one picture is worth a thousand words. The perceived power in the visual medium derives from its efficiency and multidimensionality.

Consider the following summary of the state of the world's health and wealth, drawn from data assembled by the Gapminder project:

 The last 50 years (1965-2015) have seen tremendous progress 
 in both the health and wealth of nations. The gain in life 
 expectancy has been nothing but remarkable. In 1965, Iceland 
 topped all nations with the average citizen living to 74 years. 
 By 2015, almost all of Europe, most of the Americas, the 
 majority of Asia, and even a selection of African countries 
 have reached or exceeded that level.

 In the last 50 years, much of the world has become richer, 
 when measured using GDP per capita, PPP inflation-adjusted. 
 In 1965, the majority of nations earned below US$5,000 per 
 head; by 2015, they have lifted incomes above US$10,000. 
 Switzerland with an average income of US$32,000 in 1965 
 remained one of Europe's richest nations, although, by 2015, 
 it's been overtaken by Ireland, Norway and Luxembourg. 
 Many African nations also became wealthier, with the 
 prominent exception of Libya.

Now, let's take a look at a visualisation of that data as a pair of scatter plots: the power of the visual medium is palpable.

Image: A chart showing substantial gains in health and wealth across the globe between 1965 and 2015.

The mind readily discerns the various talking points detailed above. In text, information arrives one nugget at a time, in a prescribed sequence. In pictures, our eyes wander, foraging for information along multiple dimensions at once. Cognition is guided by design elements such as reference lines, legends, data labels, and annotations. Of note, the richness of the visual medium allows complex relationships to surface, which, when expressed verbally, lead to long-winded, caveat-laden sentences.

The efficiency and multidimensionality of the visual medium arise from a set of conventions and rules, which regularises the communications between producers of data visualisation and its consumers. These conventions and rules are often unspoken: it's the visual equivalent of saying ’it goes without saying’ .

Imagine if that global health and wealth graphic was also supplied with an additional ‘How to Read this Chart’ box, as seen below:

Stop! You want to scream at me: most of those words aren't necessary. Your objection is sustained. Including the ‘How to Read It’ box belittles the advantages of the visual medium. Lengthy instructions are obviated when designers follow certain conventions and rules that are intuitively grasped by readers. It goes without saying.

In the remainder of this Long Read, we’ll highlight a core set of conventions and rules that should guide our production of data graphics. The references listed at the end provide further explanation of these and other rules. Chapter 1 of information designer Alberto Cairo's book, How Charts Lie, is highly relevant, as he outlines how consumers should read charts from a producer's point of view. Alberto emphasises the possible existence of ‘mental models’ of data visualisation such that visual communications succeed when the designer's model converges with that of the reader's.

Most conventions and rules in data visualisation are not unique -- in some cases, competing, contradictory conventions co-exist. Rules -- such as how we handle colours -- evolve over time as tools improve. Every convention has its exception: when our design deliberately turns against a rule, we call our readers' attention to the aberration, including, when appropriate, providing a ‘How to Read this Chart’ box.

In Leland Wilkinson's The Grammar of Graphics, he distinguishes aesthetics -- the encoding of data into geometric objects -- from guides, which assist understanding. Following this distinction, I organise the conventions and rules of data visualisation into two groups. In each section, two displays of the same data are juxtaposed, one conforming and one diverging from the spotlighted convention or rule, to reveal the rationale behind these best practices.

Conventions on aesthetics

Pie charts

The pie chart endures despite being the chart form most maligned by data visualisation experts. Some pie charts are serviceable, provided that they follow appropriate conventions.

Let's look at an example lifted from a note by data visualisation developer Xan Gregg. These two pie charts display languages used on the internet:

Example one: a conforming pie chart (left). Example two: a diverging pie chart (right).

Example one tells readers English is used on over half of the internet, while each of six other languages from Russian to Japanese accounts for about five percent.

In constructing this pie chart, I followed a number of conventions: a) use a reasonable number of slices, aggregating minor categories if necessary b) order the slices by size from the largest to the smallest c) place the ’Other’ slice at the end of the sequence, regardless of the order scheme d) position the first and largest slice against the upper vertical radius, and arrange the other slices in a clockwise fashion e) vary colours only if the colours are encoding data. In this case, I used a lighter shade for the ‘Other’ slice, signalling that it alone consists of multiple languages, and that it is the least important slice on the chart.

Most conventions and rules in data visualisation are not unique -- in some cases, competing, contradictory conventions co-exist.

These rules are unspoken. The designer invokes them silently, and the reader applies them intuitively. When such rules are overlooked, it takes more time to digest the pie chart. Take a look at the diverging example, in which the largest pie slice is placed at a random angle, other slices run in a random order, and each slice is assigned an arbitrary colour. When the chart maker diverges from conventions, the reader must devote time to figure out the logic of the design.

Bar charts

The principal convention on a bar chart (and by extension, a column chart) is the start-at-zero rule, which stipulates that the lower limit of the value axis should be set to zero. Our next example, adapted from The Economist, is a specimen that does not follow this convention.

Example one: a diverging bar chart

On this chart, the reader understands the retirement age in Switzerland to be twice that in France, since the Swiss bar is twice the width of the French bar. That last line can't be true, and it isn't true: the Swiss retirement age is only 10% above that of the French. When the value axis is extended to zero, as in our conforming example, the ratio of the bar widths is restored to the ratio of the data.

Example two: a conforming bar chart

The alert reader notices that the designer of example one has planted a break symbol on the left edge of each bar, signalling that its width is truncated (by more than half). Thus, the maker knowingly defies the aesthetic convention on bar charts. Acknowledgement does not fix the distortion introduced by the truncation, leading to probable misinterpretation.

Admittedly, the revamped bar chart is still short of adequate. A more effective display is achieved by switching to a dot plot as shown in example three. Another effective display option is to focus on the gaps in effective versus official retirement ages as shown in example four. Both of these designs work around the start-at-zero rule.

Example three: a dot plot (left). Example four: displaying the gaps (right).

Scatter plots

A scatter plot depicts each unit of data as a dot on a surface spanned by two axes. The horizontal (x) and vertical (y) positions of the dot encode two variables. The shape of the cloud of dots visualises the nature of the correlation between the two variables. Enjoy the splendid scatter plot in our next example, which singles out the United States as an outlier nation, in which outsized healthcare spending failed to produce the expected lift in life expectancy.

Example one: a conforming scatter plot

A convention governs which variable to place on which axis. In this example, per capita healthcare spending is purported to be a driver of health outcomes. By convention, healthcare spending (the explanatory variable) is encoded as x, and life expectancy (the outcome) as y.

In comparison, the x- and y-axes are swapped in our diverging example. Its visual form is the reflection of the same data across the 45-degree diagonal.

Example two: a diverging scatter plot

Because the design usurps the convention, many readers, especially those with training in STEM fields, will react with confusion, and even annoyance. While there is no clear design imperative for this rule, a strong scientific justification prevails.

A routine add-on to the scatter plot is the regression line (also misleadingly called a ‘trendline’ by the market-leading spreadsheet program Excel). Regression analysis quantifies the correlation between the two variables displayed by a scatter plot. The regression line is produced so as to minimise the average distance between the line and the cloud of dots. Our next scatter plot includes a regression line.

Example one: a conforming scatter plot with a regression line

Most importantly, the distance between a given dot and the regression line is measured vertically -- not horizontally. This vertical separation is also the cue by which the reader learns the chart's key message: that Americans should have been enjoying life expectancy of over 82 years, given their level of spending, if additional spending translated to incremental years of life at the same rate as in other countries.

Swapping the x- and y-axes does not reflect the regression line (as it does the dots). For what minimises the vertical distances between dots and the line does not minimise the horizontal distances. As illustrated in example two below, where I reversed the axes of example one above, the regression line of x on y does not coincide with the reflected regression line of y on x.

Example two: a diverging scatter plot with a regression line, and an overlay of the reflected regression line should axes be swapped.

Notably, this convention does not dictate which variable should be the explanatory variable, and which the outcome variable. After the designer decides these roles, the convention governs which variable is assigned to which axis. To wit, example two is appropriate if life expectancy is offered as an explanation for the variability in healthcare spending. See my blog post for more on this topic.

Time-series plots

The natural place for a time variable, such as years, months, and dates, is on the horizontal axis. By convention, time runs left to right (substitute right to left in right-to-left (RTL) countries). The following pair of charts shows the rapid growth of Chinese tourists visiting Australia. Compare example one, in which time runs left to right, to example two, in which time runs bottom to top. The left-to-right convention is an unspoken rule shared between producers and consumers of data graphics in cultures that read left to right. Veering off this rule always slow down cognition.

Example one: a conforming time-series plot (left). Example two: a diverging time-series plot (right).

Another rule for time-series charts is proportional spacing. When data are collected at uneven intervals, the tick marks on the time axis should mimic the irregularity. Otherwise, the chart distorts the pace of growth. In another diverging example below, the growth trend appears to be linear, rather than ’hockey stick’, an artefact of applying even spacing to unevenly-spaced data.

Example three: another conforming time-series plot (left). Example four: another diverging time-series plot (right).

Colour encoding

In the social media age, colour has become a favourite complement to any data graphic. Here are several main conventions guiding the application of colour:

a) Put a cap on the number of colours. As Dona Wong suggests in The Wall Street Journal Guide to Information Graphics, "admit colors gracefully, as you would receive in-laws into your home."

b) Same colour, same data; colour difference should reflect data difference. This rule disqualifies arbitrary assignment of colours.

c) Use certain colour pairs with care, as they are loaded with meaning. In the business community, black is positive, and red is negative but in some cultures, black is ominous while red is auspicious. For heatmaps, red is hot, and blue is cold, while in US politics, the red-blue colour pair denotes the two major political parties. As I noted at the start, conventions sometimes clash.

d) Many authors recommend making charts friendly to colour-blind readers, for example, by inspecting a version in grayscale.

Let’s look at two variations of our bar chart that shows retirement ages in 13 countries. On the left, the bars are assigned meaningless colours, diverging from rule b. The reader gets confused by the false signal, searching fruitlessly for the data behind the colour scheme. On the right, the design uses a unique colour for every bar, defying rule a. Here, the colour palette becomes a distractor, diminishing one's speed of understanding.

Example one: diverging colour conventions

The two plots of example two derive from the bar chart that plots the gaps in retirement ages. By the designer's choice, a positive gap means the effective retirement age exceeds the official retirement age. The chart on the left applies green to positive gaps, and red to negative gaps. Rule d advises against pairing red and green hues, because a red-green colour-blind reader cannot distinguish between them. The chart on the right encodes positive gaps in red, and negative ones in black. This choice of colours is confusing because of the convention, particularly popular in business, of using red ink for negative numbers.

Example two: more diverging colour conventions

Conventions on guides

Chart designers add guides such as legends, axes, gridlines, and labels, with the express purpose of accelerating cognition. As Edward Tufte and other experts have pointed out, such guides sometimes backfire when poorly executed. In response, a large set of conventions and rules has been developed.

Axes

It goes without saying that axes have canonical directions. On the vertical axis, larger values are placed above smaller values, while on the horizontal, larger values are placed on the right of smaller values (except in RTL countries). Usurping these rules results in nonsensical charts.

In 2014, Reuters published the following line chart that promptly unleashed a tweet storm in the data visualisation community.

Image: A chart showing the effect of Florida's Stand Your Ground law on gun deaths.

The Stand Your Ground law, which legalises using deadly force for self-protection, was widely expected to worsen gun violence, and yet this chart depicted a downward trend upon its enactment in 2005. Upon discovering the inversion of the vertical axis, readers realised that their intuition of ’lower is less’ has been misplaced. Reactions were scathing. A college professor complained: "It is so deeply misleading that I loathe to expose your eyeballs to it”. This tweet storm shows why designers should follow the conventions unless there is a compelling reason not to.

If a time dimension is involved, the convention is to place time on the horizontal axis. In a scatter plot, the outcome variable should be coded to the vertical axis, and the explanatory variable to the horizontal axis.

Two other unspoken rules -- on limits and tick marks -- inform the design of axes. Reasonable limits are chosen to remove excessive white space from the plotting surface. Tick marks should fall on easily interpretable increments and values; for example, the sequence [0, 20, 40, 60, ..., 120] instead of [2, 22, 42, 62, ..., 122], or worse, [2.3, 22.3, 42.3, 62.3, ..., 122.3].

The following pair of charts is identical, except for the axis labels. They both convey the message that Chinese tourists entering Australia have outnumbered those from New Zealand since 2017. The more precise labels in example two are harder to grasp.

Example one: conforming axes (left). Example two: diverging axes (right).

Legends

Almost all charts include a legend. A colour legend is commonly found on line charts, bar charts, pie charts, bubble charts, and more. The first rule for legends is to not use a legend if direct labels are feasible.

On a line chart with a bundle of lines, it is usually preferable to place labels next to the lines, rather than inside a legend box. It goes without saying that the colours in the legend must correspond one-to-one to the colours on the chart itself, and that the order of appearance should mimic that on the chart. Popular software such as Excel frequently makes a mess of this rule, showing the reverse order inside the legend box as items appear on the main chart.

In these graphs, featuring the eye-popping rise in Chinese tourists visiting Australia, the line labels follow the rank of the tourist counts in 2018, the most recent year with data. Example one, with direct labelling, reduces the amount of head-shaking to connect the legend key with the line.

Example one: conforming legends(left). Example two: diverging legends (right).

For bar or column charts, if a legend box cannot be avoided, the convention is to place it above the chart below the titles, as readers rely on the information to interpret the graphic. The order of categories should mirror the orientation of the data.

The National Post portrayed results from a survey of attitudes toward immigrants in a series of paired column charts, one of which is reproduced in example three. These charts adopt the conventions of ordering the countries according to the proportion of respondents who agreed with each statement, with the colour legend placed on top below the chart title. By contrast, example four uses an alphabetical ordering of countries, and a right-sided legend, which significantly complicates cognition.

Example three: conforming legends

Example four: diverging legends

An emerging convention is to embed the legend into chart titles or subtitles. Applied to the Australian tourism data graphic, the coloured text in example five points the reader's eyes to the key countries of China and New Zealand. Example six requires the reader a tad more effort to link up the chart title and the line labels.

Example five: conforming legends(left). Example six: diverging legends (right).

Order

How items are ordered on a chart has an outsized effect on the reader's comprehension. Despite software's predilection for the alphabetical scheme, it is rarely the right choice. Howard Wainer, the author of Visual Revelations and other books on data visualisation, derided this as "Alabama first!" (Alabama is the first state in alphabetical order.) Convention calls for using the natural order when it is available. Time variables, age groups, income groups, education levels, and so on all have natural orders.

Example one shows the relative popularity of crime movies across age groups in the United Kingdom, as compiled by researcher Stephen Follows. The age groups are presented in natural order, from the youngest to the oldest. Ordering by value, as seen in example two, does not work well with data that have a natural order, as the eyes jump around to re-establish the sequence.

Example one: conforming order (left). Example two: diverging order (right).

When making a panel of plots, the rule is to retain the same order of values across all charts. The pair of plots in example three, adapted from the previously-cited study of attitudes towards immigration by The National Post, illustrates why switching the order of countries from chart to chart hinders the reader's ability to compare responses of the two survey questions.

Example three: diverging order

We should lock the order of the countries throughout the panel, as shown in example four. Countries are laid out from left to right by the decreasing proportion of respondents who agreed with the first statement.

Example four: conforming order

Annotation

Text used sparingly complements the visual experience. Many authors recommend using informative chart titles. The designer must replace the default chart titles assigned by graphing software, typically formed from concatenating the axis titles. Another rule is to explain all acronyms and jargon. It is also conventional to include the source(s) of data in a footer.

When labelling data, the rule is to label items that are key parts of the story. Don't label everything. The labels, in effect, provide cues to readers as to the most significant items. Example one reproduces an earlier chart examining the effectiveness of healthcare spending, with the full set of country labels. Too many labels contend for the reader's attention.

Example one: diverging labels

When to ignore conventions

A convention arises when a plurality of practitioners agree on the wisdom of an element of design. Some rules have cognitive rationales supported by scientific experimentation, which appear to be neither sufficient nor necessary for their popularisation. Researchers Bill Cleveland and Robert Kosara have conducted some of these investigations. But almost every convention has exceptions. My advice is: think twice before you break a rule but don't think twice if you must.

I have come across many examples of charts in which one or another convention is justifiably discarded to improve understanding. Let me end this Long Read with an example in which rule-breaking pays dividends.

Imagine using example one to convey a message to American readers that the US dollar has been strengthening against the Euro since 2018. The visual impression of a trend line running down conflicts with the strengthening message. Because the exchange rate is expressed as the number of US dollars per one Euro, the lower this number, the stronger the US dollar. One band-aid to this visual challenge is to place annotations, as in example two.

Example one: a chart without annotation (left). Example two: a chart with annotation (right).

Such annotation merely sets the goalpost for a puzzle that the reader must resolve. Why does a lower line represent a stronger US dollar?

In this situation, the designer may as well break the axis rule by inverting the vertical axis. Example three is identical to our second example, except for the axis inversion (and the consequent flipping of the labels).

Example three: a diverging chart with an inverted axis.

Another way to achieve this effect is to invert the exchange rate ratio, expressing it as the number of Euros per one US dollar. This solution won't please the financial community who are accustomed to looking at the US-dollar-to-Euro ratio.

Conclusion

The visual medium excels at conveying a large amount of information in multiple dimensions efficiently. Such efficiency relies on a set of unspoken rules and conventions, shared implicitly between producers and consumers of data graphics. In this Long Read, we’ve reviewed a selection of major conventions covering both aesthetics and guides of charts. Designers of data visualisation can exploit these conventions to simplify their graphics, removing unnecessary explanations. Recognising these unspoken rules helps avoid unintended misunderstanding. As with all visual design, depending on your specific application and audience, it may occasionally be prudent to defy convention. Lastly, think twice before you break a rule, but don't think twice if you must.

Recap of conventions

Conventions on aesthetics

Pie charts

  1. Use a reasonable number of slices
  2. Aggregate minor categories into one ’Other’ slice
  3. Order slices by size from largest to smallest
  4. Place the ’Other’ slice at the end of the sequence, regardless of the order
  5. Position the first and largest slice against the upper vertical radius
  6. Arrange slices in a clockwise fashion
  7. Vary colours only if the colours are encoding data

Bar charts

  1. Start value axis at zero

Scatter plots

  1. Place explanatory variable on horizontal axis
  2. Place outcome variable on vertical axis
  3. If adding a regression line, assign an outcome variable

Time-series plots

  1. Plot time on the horizontal axis
  2. Time runs left to right
  3. If time intervals are uneven, tick marks should be uneven in the same way

Colour encoding

  1. Limit the total number of colours
  2. Colour difference should reflect data difference
  3. Use certain colour pairs with care
  4. Make charts friendly to colour-blind readers

Conventions on guides

Axes

  1. Use canonical directions (ie. larger values to the right of smaller values)
  2. Time goes on the horizontal axis
  3. Place an outcome variable on the vertical axis
  4. Choose limits to remove excessive white space
  5. Tick marks should fall on easily interpretable increments and values

Legends

  1. Use direct labels if feasible
  2. Colours in the legend should correspond one-to-one to the colours on the chart
  3. Colours in the legend should be presented in the same order as they appear on the chart
  4. Place legend on top below the title 5.Embed legend into chart titles or subtitles

Order

  1. Place values in the natural order when it is available
  2. Avoid the default alphabetical order unless it is justified by the context
  3. Retain the same order across all plots in a panel of charts

Annotation

  1. Use informative chart titles
  2. Explain all acronyms and jargon
  3. Include the source of data in a footer
  4. Label only key items, not all items

The following books and blogs are great resources on the conventions and rules of data visualisation as described in the Long Reads:

]]>
Journalism first: doing advocacy with data on your side https://datajournalism.com/read/longreads/doing-advocacy-with-data Fri, 17 Jan 2020 15:59:00 +0100 Eva Belmonte https://datajournalism.com/read/longreads/doing-advocacy-with-data Some of the best examples of data journalism are big investigations, where you spend months understanding a complex issue, discovering data where there was none or diving into huge amounts of information to find something invisible to the naked eye. Bringing new light to an issue. But to understand the complexity of this data you need time, resources, and a lot of digging.

By the end of that odyssey, you no longer consult experts, you are the expert. And it is a waste to not take advantage of all that knowledge to try to fix the problems you encountered along the way.

Now, you might say: “We are journalists. We are here to narrate the world, not to fix it.”

I'm not so sure.

What I am sure of is that sometimes publishing a story is not enough. And that the line separating journalism from advocacy -- which has always been there, even if it has been hidden -- is thinner than it may seem to many, in both traditional journalism and in the most cutting-edge data journalism teams.

Miriam Wells, impact editor at The Bureau Of Investigative Journalism, said in an interview with NiemanLab that she felt “a bit frustrated” with traditional journalism: “No matter what you write, no matter how much of a splash it makes, it doesn't always make a change.”

Miriam’s role, as she explains, is, among other things, to bridge the gap between journalists and activists. And this relationship between the two camps generates many doubts: How do you treat those same activists when they are also sources? How does it affect your editorial independence, or modify which topics you investigate and which you do not?

But what if we go one step further and the journalist becomes an actual activist? Those questions become even more tricky. Yet, it is very worthwhile to try and solve them. In Miriam’s words: "I became a journalist because, even though it's cliché, I wanted to make a difference."

You have the data on your side: take advantage

At Civio, a small non-profit newsroom in Spain, we have been combining investigative reporting and data journalism with activism for years. We have asked ourselves these questions millions of times and we are very aware of the need to build a Chinese wall between investigations and activism, and between the goals of each, which can be quite different. But yes, we lobby. And it all started because it was impossible for us to stop thinking about the problems we had found just because we had published an article, when we saw the solution so clearly. How can you not do anything when you have the data on your side?

Although the nomenclature varies and examples are scarce, we are not the only ones that combine activism and journalism. ProPublica does it, too. In a white paper, Issues Around Impact, ProPublica president Richard J. Tofel links the relationship to solutions journalism and explains the reasons why ProPublica sometimes goes one step further: “When a problem is identified by reporting, and when a solution is revealed as well — e.g., nurses with criminal records are not having their nursing licenses revoked but could be, or presidential pardons are being issued and withheld on a racially discriminatory basis due to Justice Department internal guidelines that could be changed at the stroke of a pen — it is appropriate for journalists to call attention to the problem and the remedy until the remedy is put in place.”

No matter what you write, no matter how much of a splash it makes, it doesn't always make a change.

One of those examples arises from the long-running and deep investigation ProPublica have been conducting for years on presidential pardons. In this case, what the data have shown, and what ProPublica is fighting, is a clear discrimination by race in decisions to grant --or not-- presidential pardons.

In Spain, the essential question about pardons is different, because the context is different. Here the fact is that it is much easier to receive a pardon if you have been convicted of a corruption-related crime compared to a common crime, such as theft or small scale drug offences. Some 227 people have been pardoned for corruption in the last 23 years. And religious orders have a preferential path for pardons. All these headlines emerged from The Pardonometer, an investigation by Civio that launched in 2013 and that, in addition to many stories, created the first database of pardons in Spain, now used by other journalists and activists.

With that experience behind us, we went to Congress to share our point of view, based on our data, on how to reform the century-and-a-half old Law of the Pardon that parliamentary groups were negotiating. We asked for two things: firstly, that pardons should no longer be at the Government's discretion and, secondly, that pardons should be given some oversight -- on the part of the sentencing judge or the part of the parliament. We knew that the pardons process was being abused to forgive corrupt people, sometimes members of the ruling party itself, or public officials. With this in mind, we also asked that the whole process be made more transparent.

This is where a certain professional selfishness comes in: During our investigation, we found a tremendous lack of information around pardons. There was no data on who requested the pardon, and the reasons for pardoning one person or another were not published. We asked that all of this information be made public. For us. For all journalists who might investigate this issue. And, finally, for all interested citizens.

Choose your battles

But what battles should a journalist fight? Is every cause worth it? ProPublica’s Richard Tofel wrote: "When something is literally indefensible, and when the means of remedy are clear and certain, journalists should not hesitate to suggest how change could occur."

Are we sure that "literally defenceless" means exactly the same thing for everyone? Of course not. We can talk about human rights and find a general consensus -- or not, depending on the times. Or, we look at issues of common concern, like political corruption. Any fight against corruption, in countries where it is prevalent, such as Spain, may seem fair in everyone's eyes. It is one of the areas in which Civio advocates, although not the main one. Even so, there will always be someone who differs on the details or thinks that journalists should not get involved.

What I am sure a journalist can defend, and what they should in fact defend, because it is our profession, is freedom of information. There is no freedom of information without transparency and a right of access to information. That is where our activism becomes selfish -- in the best sense of the word -- because we are fighting a battle to defend our own field.

Civio's activism, for example, is very limited. That is because it requires a lot of resources and time. The path we have been pushing for years does not go through publishing a statement and waiting to be ignored until media pressure builds up, something that would be the activist equivalent of publishing an article and waiting for the problem to be solved alone. Instead, it requires us to study laws, draft amendments and propose concrete improvements based on the data extracted from our investigations. And, since it is limited, it focuses on the core of what we do.

Without good laws, there is no data. Without data, there is no story.

Virtually all of our battles focus on demanding more transparency and access to information. If it is important in traditional journalism, then it is even more important when we talk about data journalism. The laws of access to information are one of the most powerful tools that data journalists have to get stories. Without good laws, there is no data. Without data, there is no story.

Demanding better transparency laws, litigating in court against the concealment of information, and demanding that key data be public...All are struggles for our raw material: neither more nor less. That's where journalists can feel the most comfortable, where activism makes sense, even though it also benefits everyone else.

Journalists may feel pressure to find the solution to a general problem that affects all citizens, such as pardoning corrupt criminals, and demand more control and transparency in the process. That might reveal unexpected data to investigate and benefit their journalism.

Or, the other way around, a media outlet may press for the prices that governments pay for their medicines to write an article about new and expensive medicines and their impact on the health system. Then, perhaps the obligation to publish that information puts pressure on pharmaceutical companies, who might give greater discounts or patient groups may demand adding new drugs to the public health system, benefiting all citizens.

In neither case have we lost sight of journalism: it is always present as a public good worth protecting.

There is still a third case: when fighting, after an investigation, to eliminate the obstacles that we encountered along the way and made it difficult, if not impossible, to obtain the data we needed to tell our story. It’s a sort of final revenge.

But first, journalism

In 2012, Civio began to investigate public contracts through a project that, every day, reads, analyses, and contextualises everything that is published in the Spanish BOE, Our Daily Official Gazette. From individual cases -- corruption, inflated contracts, contracts to people linked to public offices, and so on -- we turned to a more global analysis appropriate to data journalism with Who’s paid for the work?. In it, we dug into thousands of contracts published in the BOE since 2009 to report that, among other things, just [10 construction companies raked in 7 of every 10 euros] allocated in the Official State Gazette, in contracts for public works. This project took months of work. A very important part went to obtaining an in-depth understanding of the complex legislation on public procurement. Another, even more important part, was cleaning up horrific data which had never been prepared for machine analysis, had millions of errors and that -- and this is key -- was missing a lot of information.

One of the main barriers we found was that we could not find something as simple as how much money each company took in each Temporary Joint Venture (TJV) and their percentage of participation. The state did not publish that data. Nor did it publish the identity of all bidders, so we could not investigate the distribution of contracts or cartels.

When Congress debated a reform of the Public Procurement Law we argued that this data should be public. Now, thanks to that pressure, consisting of dozens of pages of proposed amendments that we sent to the political parties and some of which ended up being included in the new law, the administration must publish those data with each contract. The key is that the advocacy came after the investigation, not before.

Be independent and prove it

According to Richard Tofel, ProPublica journalists cannot lobby. The concept is not the same in the United States and in Europe. In this white paper on impact, he rules out participation in, or the organisation of, demonstrations, or arguing for partisan proposals. But he does defend putting the pressure on until solutions are found to the problems uncovered in investigations. For us, that is lobbying. But we are not lobbyists for hire. That is why we must not only lobby without compromising our independence, which is fundamental to journalism, but by being more transparent than anyone.

To prevent anyone from linking our independent data journalism to a political party, we must treat all political parties in the same way. At Civio, we follow several rules to ensure this. First: we speak with all of the parties represented in parliament. If any one of them asks us for an assessment on a topic and we believe it is relevant, we publish our assessment and then send it to everyone, not just them. Again, the timing matters. First, we publish a freely accessible list of proposals for amendments to a law, for example, and then we share it with all political parties. It is important that the documents we promote are always known to everyone, without cheating or subterfuge. An open lobby.

And a transparent lobby. In the same way that we demand the Government and public representatives make their meeting agendas transparent, we publish all the meetings we have with public representatives and political parties, including the participants, the reasons for the meeting, and the documents exchanged. We never, ever, attend meetings without a clear agenda.

You can be an activist after publishing, but not during an investigation and you may not treat activists differently from other sources, despite the temptation.

The issue of timing is much more important than it seems. Because activism cannot be what starts or moves the wheels of an investigation. It cannot set the agenda. Journalists cannot go into an investigation with preconceived ideas. They must approach the data with an open mind and without prejudice. Therefore, activism in journalistic organisations such as Civio only makes sense after publishing. Along the same lines, Richard Tofel argues: “Journalism begins with questions and progresses, as facts are determined, to answers. Advocacy begins with answers, with the facts already assumed to be established. In short, advocates know before they begin work the sort of impact they are seeking, while journalists only learn in the course of their work what the problem is, and only after this can they begin to understand the kind of impact their work might have.”

And this is not only key when choosing which topics are investigated and then confronting them without prejudice, but also when dealing with sources. You can be an activist after publishing, but not during an investigation and you may not treat activists differently from other sources, despite the temptation. For example, after publishing Medicamentalia, our investigation on access to health, we advocated for transparency in drug prices, in government negotiations with industry, and in the relationship between health and pharmaceutical professionals.

Civil society organisations dedicated to the fight for access to health have not only used our data for their campaigns, but sometimes we have shared pressures when our interests coincided. That can be a problem. Because a journalist should not be close to their sources or develop sympathy for them. It can affect your independence. And it is tempting: NGO members are much more friendly and open to this type of investigation than the communications staff of large pharmaceutical companies.

Therefore, the key is time frames, again, and let's not forget that sources are always sources: with their own interests -- however noble they may seem -- and their own agenda. And we must treat everyone equally. This rule is especially important when we talk about data journalism, and about fact-based stories, because 1) you have to distrust data that comes from all sides, always; and 2) the statements or stories told by one or the other side cannot be the basis of your stories.

Not only words or data: a complete case

Sometimes, to tell a story, not only words or data are enough. Sometimes a story leads you, unintentionally, to try to find the remedy to an unfair situation by all means that are necessary, even if they go beyond traditional journalism. That was the case with our electricity subsidy story.

In 2017, Civio began reporting about new legislation that would modify access criteria to the so-called Social Electricity Bonus, which is a discount on energy bills for at-risk individuals and families. We tried to explain the conditions for eligibility, and we quickly realised that it was very complex to understand. So we took another step, to go beyond words, and created an application that, by introducing minimal basic data, tells readers whether or not they have the right to the subsidy, in addition to guiding them through the complicated process of requesting it.

Thousands of people used the application and hundreds wrote or called us to answer the questions that neither the electricity companies, which were the mediators, nor the Government, had answered. During this process we published several articles reporting that almost two million people who could benefit from the aid had not done so due to lack of information or the complexity of the system. We also reported that the internal application used by the Government to distribute the subsidy was denying it to people who qualified for it.

Here comes the advocacy: to solve the first problem, we collaborated with the administration by proposing ways to improve the subsidy to make it easier for citizens, which will be included in the national strategy against energy poverty. To solve the second problem, we asked for the administration’s source code but, although the Transparency Council ruled in our favour, our request was denied.

That is why, at the beginning of 2019, we appealed that decision in court.

Nothing new under the sun

Richard Tofel says that, "Squeamishness about staying with such a story until reform is undertaken has been a weakness of the traditional press in recent decades, not a sign of virtuous neutrality."

But the truth is that the traditional press has always pressed for changes. There are, for example, editorials asking readers to vote for a particular party in elections, or those that demand certain legislative reforms from the Government.

Editorials and cover pages have been used too often as a tool to change laws or move governments. Or, directly, for much more mundane reasons, publisher’s associations have lobbied for tax improvements for themselves or more institutional advertising.

That is the fourth estate. But non-profit data and investigative journalism media have different motives. It is not about partisan struggle, nor about power. They are not peddling opinions. In fact, there is no room for opinion in their pages.

Instead, they have the data on their side.

Civio's rules for advocacy

Maintaining independent journalism with its own agenda while pressing for changes is not easy. Some key rules to remember:

  1. The investigation must not stem from nor may it be motivated by change instead of journalistic interest.
  2. Unlike activism, journalism must be guided by an open mind, without prejudice, especially when analysing data.
  3. Advocacy only after publication, never before.
  4. Only advocate on topics in which we are experts after our investigations. With the data in hand.
  5. Limited advocacy, focussing on issues related to our journalistic mission: access to information, transparency.
  6. Transparent and without partisanship, treating all parties equally and making the entire process open.
  7. All sources are equal: we must distrust them all, along with other activists.
]]>
Tackling math anxiety in journalism students https://datajournalism.com/read/longreads/math-anxiety-in-journalism Wed, 11 Dec 2019 10:28:00 +0100 Kayt Davies https://datajournalism.com/read/longreads/math-anxiety-in-journalism A group of journalism educators from around the world, all passionate about data journalism, gathered in a sunny classroom of Paris’ Dauphine University, keen to find a way forward.

Less than half an hour in there was a clear divide, almost becoming tense, between two schools of thought. The first was the argument that we need to start teaching students about coding, and that the failure to do so is irresponsible. The dismayed other side lamented that their students lacked ability, confidence, and/or willingness to engage with numbers at all, and that coding was a bridge too far.

The gathering was the World Journalism Education Congress’s Data Journalism Syndicate in July 2019 and the chair, Professor Norman Lewis from the University of Florida, brokered peace by calmly noting what was happening and identifying it as a schism present in the current landscape. The whole room agreed that we need to do more, and we need to do it better, but getting some students to do any number-based reporting, or to approach it at all, is a massive challenge.

Over two sessions of conversation the group debated the question:

What essential computational skills must emerging journalists learn to successfully work with data, and what approach should we take toward teaching them?

The group’s answer -- by consensus -- was a recommendation to focus broadly on ‘data literacy’ rather than on using any one specific programme or on writing code. We workshopped the term ‘data literacy’, and decided it included basic maths and understanding numbers, as well as how research is conducted, the limits of statistics, and common errors in interpretation. We concluded that it was important to: “Teach a foundational understanding of numeracy and quantitative data, sufficient to confidently interpret numbers and avoid errors so that math-averse students can confront numbers with courage”.

Confronting numbers with courage

Courage is an important word in this context, because for many students fear is a limiting factor, not just lack of ability. But this is not a new notion. Way back in 1972, Frank Richardson and Richard Suinn developed a Mathematics Anxiety Rating Scale (MARS) for use in the diagnosis and possible treatment of the problem. Their research rested on earlier studies from the 1950s and 60s, which found that “different kinds of anxiety lead to different effects on intellectual performance”.

Cleveland State University Psychologist Mark Ashcraft is one of many who have continued to explore the experience of math anxiety. In 2002, he conducted a study using a shortened version of the MARS. He defined math anxiety as “a feeling of tension, apprehension, or fear that interferes with math performance” and wrote: “Highly math-anxious people also espouse negative attitudes toward math and hold negative self-perceptions about their math abilities…It is, therefore, no surprise that people with math anxiety tend to avoid college majors and career paths that depend heavily on math or quantitative skills, with obvious and unfortunate consequences.”

More recently, researchers investigating the pedagogy of data journalism have noted that many journalism students clearly self-identify as math averse. Amy Schmitz Weiss and Jessica Retis-Rivas from San Diego even called one of their articles ‘I don’t like maths, that why I’m in journalism’ because it was a refrain they heard so often.

Looking at this body of work and the lived experience in the roomful of educators in Paris, it seems safe to say that math anxiety is a problem in contemporary journalism education, and it needs to be addressed. The next question is how.

Silo-shifting into education and psychology literature reveals that there is already a healthy body of research on this topic, as educators across fields ranging from teaching to politics, philosophy, and advertising have tackled the problem of math-phobic students and written up their results.

In this Long Read, we’ll highlight three of the key ideas they have explored and how their insights can be incorporated into journalism classes.

1. Talk to your students about math anxiety

Insights from research

A study led by Allison McCulloch in 2013 looked at the problem with a group of trainee elementary school teachers. Math anxiety is a known problem in this setting. The researchers asked each of the trainees to write mini-autobiographies describing the origins of their self-perceptions about their math-ability. These autobiographies provided valuable insights into the causes and mutability of math anxiety. The researchers also found that positive transitions in the participants’ stories “were always related to a particular teacher who made them feel comfortable, cared about, and believed in”. And, importantly, their participants reported that the process of documenting their math-perception-formation-process contributed to reducing their math anxiety.

Earlier, in 1998, Norma Harper and CJ Daane put trainee teachers through a math-anxiety reduction course and reported on its success. They observed that interventions need to help students reflect on their own past math experiences and anxiety levels to enable them to perform better as teachers.

Likewise, Anne Wescott Dodd recommends giving students a questionnaire on the first day of class asking “how do you feel about mathematics?” and “how did you do in mathematics last year?” to identify who will need the most help. One of the key aspects of math anxiety that she identifies is the loneliness that arises when students develop a belief that everyone else in the class understands what is being explained -- “they’ll suffer in silence rather than risk looking stupid by asking a question”. She recommends collaborative and cooperative activities to mitigate this problem.

Experiences from a journalism classroom

In three successive iterations of an undergraduate journalism unit that explicitly includes data journalism, I have devoted the first class to talking about math anxiety -- in three stages.

In the first, I tell my students that research has identified it as a global problem across many disciplines, across many decades. I tell them that it is a learnt self-perception, that it can be changed, and that overcoming their anxiety will make them better at maths.

In the second stage, I invite them to consider whether they are math anxious (and to what extent) and where they may have gotten those perceptions, and I allow them time to tell their own stories in pairs or small groups.

In the third stage, I tackle the loneliness/isolation issue raised by Anne Wescott Dodd by identifying myself as a ‘recovering math-phobic’. I invite them to join me and the other math-anxious class members in a collective journey of ‘feeling the fear and doing it anyway’. We then set ground rules for the class that include giving permission to ask any question about maths -- ‘no question is too basic’ and ‘all questions help make me a better teacher’. I reassure them that non-judgemental support will be provided for any math-related knowledge gaps and that making progress from where they are now is the aim of the class, not getting a specific answer correct under stressful circumstances at the end.

2. Be supportive, fun, and funny

Insights from research

In their 1998 study, Norma Harper and CJ Daane also cited studies dating back to the 1980s that advocated the prime importance of providing students with a supportive environment.

This was reiterated by Tina Rye Sloan in 2010, when she interviewed 72 preservice teachers about their own math anxiety. In her findings, she concluded that it was important for educators to “create a supportive atmosphere with mutual respect and acceptance … [and] an emotional climate that is inviting and reassuring”.

In a 1990 article called ‘What’s funny about statistics? A technique for reducing student anxiety’, Steven Schacht and Brad Stewart describe the usefulness of humour in tackling math anxiety and conclude that if it’s done right, it works a treat. They warned against using comedic content that featured aggression, sexual content, or that mocked college students, because off-jokes can increase anxiety. Instead, they used cartoons to illustrate and frame problems in a mathematics course for social science students. Their participants reported that the cartoons lightened the mood, increased the fun, and reduced their math anxiety.

And, in the same 1992 article, Anne Wescott Dodd noted that: “Changing negative beliefs is a slow process. Success is more likely to occur first on a small task than on a large one, such as a unit test. Wise use of games, group activities and carefully chosen assignments may be needed to overcome firmly entrenched beliefs.”

Experiences from a journalism classroom

Structuring a unit so that some marks are allocated for the completion of collaborative in-class tasks can encourage engagement with numbers. Another approach is to allocate marks to a journal of reflections about what they learnt from in-class activities. These approaches make attending class attractive as a venue for socialising and allow space for risk taking and creativity.

Peer-to-peer learning can be further encouraged by stressing that those in the class with more math ability and confidence can benefit from helping other students, as teaching deepens understanding, and the ability to explain things is a key media skill.

In one tutorial, for example, I hand out a ten question quiz of math problems all phrased in journalism terms:

  • ‘For a story on housing prices, you need to calculate…’
  • ‘For a story about a music festival, you need to compare attendance figures’

Rather than having students work alone, they work in pairs and need to agree on all of their answers. Then, I get them to pair up again and compare answers, explaining their working for anything that they differ on. I walk around the room and quickly mark each group’s answers -- students need to continue working on the problems that aren’t right, but I let groups that have the correct answers help them. When the whole class has all of the answers the task is complete. To use gamer-speak: ’achievement unlocked’.

Incorporating cartoons, memes, and photos of math-anxiety related merchandise (yep, google for it) into PowerPoints can lighten the mood and underline how universal it is, reinforcing the commitment to overcome it.

3. Plan a learning trajectory

Insights from research

In a 2014 paper, Johan Adriaensen, Evelyn Coremans, and Bart Kerremans developed and implemented a ‘Learning Trajectory of Quantitative Methods’ as a progressive approach to teaching research methods to sociology and political science students, with the aim of reducing math anxiety.

Their trajectory breaks the learning process down into four steps, derived from the stages in conducting quantitative research. These are their four steps:

  1. From concept to variable: In the first stage, students learn to transform an abstract concept into a measurable indicator. The emphasis is on the operationalisation stage of the research process whereby students are made aware that for each concept, multiple indicators are possible, each with their own strengths and weaknesses.
  2. From variable to data: Once a concept has been operationalised, students learn to look for the appropriate data. Given limitations on data availability, previous choices might need to be re-evaluated.
  3. From data to descriptive statistics: A first step in analysing data consists of descriptive statistics. Students learn to interpret graphs and tables, select the most appropriate (visual) representation and draw meaningful conclusions from their data.
  4. From descriptive to analytic statistics: In the last step, students learn to execute and interpret analytic (inferential) statistics, including evaluating and scrutinising research articles.

Experiences from a journalism classroom

Consideration of these four steps is useful because it helps ensure that the content added to units across a major or degree course are building on earlier teaching, rather than repeating it. In addition, categorising data journalism activities into these four groups can illuminate gaps in a programme of learning.

I have also found that context is key for media students, as they are quick to ask “why is this relevant?” and “what has this got to do with journalism?”. Here are a few teaching ideas to answer these questions, categorised by trajectory stage.

1. Concept to variable

Ask small groups to think of an issue they could (as a team of journalists) delve into and to develop a pitch for the teacher/news editor/class. Tell them their format will have to include a series of infographics. Hand out coloured pens and a blank local map, a world map, set of human figures, and a timeline, and ask them to think about how these could be filled with data to do part of the storytelling. Could they show their answers to the questions of who, when, where, why, what and/or how? Be clear that they are not doing the research at this stage -- they are just making a research plan, so they know what they will go looking for and can ask the hypothetical boss for permission to spend time doing it.

Students enjoy this task because they get to choose a topic, work together, and colour-in. Their topics typically start out broad (beer in this city is too expensive; too many forests are being cut down; sexual assault is not acceptable) and confronting them with maps and graphics makes them consider what figures are relevant. For example, for a beer story would they use tavern prices, bottle-shop prices, all volumes, all brands, or just one type? And would they then compare prices in all cities, or cities versus regional areas? Or suburbs with different average incomes? This is exactly the work of ‘operationalising’ a research project.

Materials provided to students during this activity.

Here’s another activity to prompt critical thinking about variables:

Ask everyone in the class to write down, secretly and immediately, their favourite flavour of ice-cream. Use the whiteboard to do a quick round of the class with each person singing out their flavour -- give extra ticks to ones that get a second or third vote -- and voila, you’ll have a huge list of different flavours (almost as many as there are students) with something like salted caramel or cookies-n-cream garnering a few extra ticks and winning the popularity contest.

Then tell your students that you are hypothetically about to order a big tub of ice-cream for the class to share and you need to know whether they want vanilla, strawberry, or chocolate. Do a quick tally on the board of votes for each of these three.

Now, ask them which vote correctly shows the class’ preferred ice-cream flavour? Is it salted caramel or chocolate? When they say salted caramel, challenge them by saying “but only three of you chose that, 15 chose chocolate”.

Ask them how they would have to report these numbers in order to be accurate. Workshop ways of wording it.

You can talk about how this applies to other research that they may find themselves reporting. You can discuss how limiting the options can change the answers. Should researchers always include an odd number of options in Likert scales to allow for a neutral centre? Does the failure to do so force participants to give dishonest answers? You can also talk about political push polling.

Next, ask them if this data could be reported as the ice-cream preference of only this class, or all journalism students at the university? How about all communications students in the city? How about all young people in the whole country? At what point does the sample size become too small to be reliable? How would you have to report that in order to be accurate?

You can follow this up further by looking at some studies that disclose their sample sizes or sampling techniques (blind/double-blind/deliberative/random) and compare that with how the study has been reported in the media. In this interactive class (that contains a lot of thinking about ice-cream), the students have come up against some critical data literacy issues.

2. Variable to data

It is worth talking about what data is likely to be available and what isn’t. If your government has an open data site like data.gov.au encourage the class to explore it. Talk about privacy -- they won’t be able to access individual medical records, but a lot of health data is available. Talk about transparency, freedom of information, and secrecy. Talk about the dark web and all the information locked behind passwords and hidden in unsearchable images. Talk about the harsh reality that sometimes you can’t get the data that you want and need to move on to the next best option.

Following this discussion, send them data hunting. Fact-checking is a great way to get quickly into it. Australian political discourse is so chock full of talk about mining and mining jobs that when I ask a class what proportion of the Australian Workforce they think works in mining some guess as much as 50%. How can we check that? I send them on a treasure hunt -- searching in pairs for the most current data they can find. Job statistics are on the Australian Bureau of Statistics website in large Excel files with many tabs. I ask them to find me four numbers: total jobs for our state, mining jobs for our state, total jobs for Australia, and mining jobs for Australia. They manage this task fairly easily. The trick is for me to have found the data the night before, so that I can quickly identify if they have found it or turned up something else.

Tell them what data scraping is, even if you don’t get as far as doing it. That said, you can do a simple scrape in a two-hour class. Just do it yourself the day before and make a very step-by-step PowerPoint with lots of circles and arrows, so they can follow along and get it right. Using the free Chrome extension Open Web Scraper, we scraped our state government’s tender website and made a list in Excel of who the government had paid to do what. We cleaned the data (because sometimes dollar signs were included in amounts and sometimes they weren’t) and we sorted it. Then, they had to find four story ideas in it (For example, why was $73,115 spent on anger management training for the staff of a particular agency? Why was a million dollars spent on a new desk for a regional police station?)

A snippet of Kayt’s data scraping tutorial.

3. From descriptive data to descriptive statistics

Part two of the mining jobs task is to step our way through how to make that information into two pie charts that show the proportion of mining to total jobs in our state versus the nation. We use Excel and I encourage students to help each other, as well as using online percentage calculators and YouTube tutorials if they need extra help or want to check their numbers. We cover the basics: Pie charts need to add up to 100%. Did they remember to subtract the mining jobs from the total jobs before they made the charts? This low-stakes quick activity is a confidence-builder that makes something that they could very easily use in a story about mining jobs.

Starting with simple pie charts, students quickly learn how data and graphics can enhance a story.

There are plenty of datasets available that you can challenge a class to turn into various kinds of graphs and infographics. In my experience, their rookie mistakes will include errors in considering scales and axes, an inability to control the size of fonts and labels (making some of the information unreadable), and unfamiliarity with what to write next to a graph, to introduce it to the story and allow it to do some work, without repeating its content and rendering it just an illustration of the text.

It’s also worth spending a bit of time on mean, mode, and median, along with when and why they are used in the context of journalism -- mean for most of the time, mode for data with crazy outliers (like house prices), and median for categorical data (like ice-cream flavour preferences).

At this stage, if you are not feeling like a fully-fledged maths teacher, it’s wise to embrace blended learning and let online tutorials do some of the explaining for you. There are plenty available and it’s a good way to balance out a range of skill levels in a classroom. You can also add an extra incentive by offering a couple of marks for students who send you a completion certificate from an online course (such as the ones from Lynda.com/LinkedIn Learning). This strategy can enable students who need to do basic courses, while more advanced students can level up.

If you have time, pivot tables are fun and fairly easy to teach. What you need is a big dataset, links to a couple of good YouTube explainers and a set of questions that can be answered by sorting the data in various ways. I have done this a few times in a two-hour class, encouraging students who find it easy to help others, and it has worked fine, they all get it.

(Hint: If your university is likely to have issues with a whole class downloading a dataset all at once it can be worth having it on a USB stick, or preloading it onto the class computers.)

4. From descriptive to analytic statistics

Given that the WJEC data journalism group’s recommendation was to “…[t]each a foundational understanding of numeracy and quantitative data, sufficient to confidently interpret numbers and avoid errors…” it is important to talk about inferential statistics. But, it’s also important to remember that none of the disciplines manage to teach this kind of maths quickly -- there are entire units are devoted to it in psychology and the sciences. What can be taught in a compressed format though are some of the key concepts, such as hypothesis testing, causation vs correlation, significance, assumptions about normal distributions, deviance and margins of error, and significance/confidence. Ben Goldacre’s Bad Science_ _Ted Talks are a good launching point for some of these discussions. By all means, encourage students to explore R and SPSS, but they are big missions for a crowded undergraduate programme, and understanding core concepts is important groundwork for the use of those programmes anyway.

Conclusion

I hope these tips will encourage experimentation with introducing small data journalism activities into journalism units, from first year to final year, and with disadvantaged and math averse students, as well as with the accomplished and confident. The key learning I have taken from a few years of working on this challenge is that nothing is more important than seeing your students for who they are, and offering, in a non-judgmental way, to help them make a change.

For more on teaching and overcoming math anxiety, check out:

]]>
Geographic information systems: a use case for journalists https://datajournalism.com/read/longreads/geographic-information-systems-a-use-case-for-journalists Thu, 28 Nov 2019 10:30:00 +0100 Jacques Marcoux https://datajournalism.com/read/longreads/geographic-information-systems-a-use-case-for-journalists The only path to becoming a successful data journalist is to commit oneself to a lifestyle of continuous self-learning. This is simply the price of admission to specialise in this ever-evolving field of journalism.

At every turn, we are confronted with steep learning curves that require us to decide where we will invest our limited time and mental energy. These decisions are often made on the basis of the journalist’s perceived return on investment:

  • Python or R (and then which libraries?)
  • JavaScript or off-the-shelf visualisation tools?
  • Postgresql, MySQL, or SQLite (do I even need databases?)

While grappling with those tough questions will inevitably remain a rite of passage, let me propose at least one learning trajectory with guaranteed journalistic returns: proficiency with geographic information systems (GIS).

What is GIS?

Speak with a few GIS professionals and a common theme will emerge: they struggle to explain to their loved ones exactly what it is they do.

Many of us understand superficially that GIS has something to do with ‘mapping’ and ‘geography’, but this is just the tip of the iceberg.

Similar to how tools such as spreadsheets or databases are used to manipulate, summarise, query, edit, and visualise information, GIS allows the same operations to take place -- but with the addition of a spatial dimension, connecting your data to a location in space.

For example, if you had a database of all homes built in your community, it might contain details about each house’s features, such as the year it was built, the number of floors, the total living space, the value of the property, when a building permit was last issued, and much more. With this information you could derive all kinds of interesting insights about the makeup of homes in your community.

By adding geocoded home addresses to this database, you would now have the ability to evaluate these homes based on their physical location to one another, on their density in certain areas, as well as their proximity to certain landmarks, such a landfill or a train station. This is GIS in its simplest form.

GIS touches every aspect of our lives

GIS technology and concepts are all around us and have real-world consequences. The following are just a few examples that are of great public interest:

  • emergency services dispatching
  • forestry management
  • traffic and public transportation management
  • flood forecasting and climatology
  • housing development
  • epidemiology and public health
  • online food order and ridesharing services
  • mail and parcel delivery services.

Any journalist hoping to closely scrutinise policy decisions emanating from these areas would be well served by learning the same tools and concepts that drive many of those very decisions.

This is GIS-driven journalism in response to the rise of GIS in society.

This is no different than a traditional political reporter learning basic accounting principles in order to make sense of government budgets and annual reports.

The good news is: Many data journalists have already embraced the use of spatial data and mapping in their storytelling. A 2017 study by researchers at the University of Hamburg found that maps were used by half of the 225 projects nominated for the Global Editors Network’s Data Journalism Awards between 2013 and 2016.

While collectively we are making use of maps as a powerful visualisation tool, my observation has been that many data journalists are missing out of some key opportunities to uncover additional insight within their data, especially spatial data.

Cartography vs. GIS

A significant number of maps used in media publications would more appropriately be classified as a form of cartography ⁠-- that is, mapmaking for the purpose of providing some form of geographic context through graphic visualisation.

Most fledgling-data journalists have at some point in their career succumbed to the irresistible urge to interactively plot any spatial data they could get their hands on, often using the soon-to-be-fully-deprecated Google Fusions Tables.

This was especially true at a time before the open data ethos took root in many public institutions, and spatial data was often closely guarded by internal gatekeepers. The novelty of having that map file in hand after months of freedom of information requests felt like justification enough to publish a map.

This author is guilty as charged.

The following map illustrates (albeit in an extreme way) the limited usefulness of simply representing information on a map:

At the core of the issue is the distinction between cartography, which is largely about representing data graphically, and GIS which seeks to analyse the spatial relationship between elements on a map.

Modern day roadmap: a Victorian era case-study

Widely considered as one of the first examples of modern epidemiology, English physician John Snow’s geographic tracking of an 1854 cholera outbreak in London is a textbook example of insight through GIS.

In light of a mounting death toll in a specific neighbourhood in London’s West End, John Snow embarked on a study that involved mapping the home residences of all persons who died from cholera infections. His review of the resulting data showed a tight clustering of fatalities around a single water source known infamously as the ‘Broad Street Pump’. It would later be discovered that the water source leading up to this public street-level supply was contaminated by raw sewage.

As put by former news and data editor for the Guardian, Simon Rogers, John Snow’s study and reasoning gave data journalists a 'working model' for how to approach their craft.

Consider this: If John Snow were alive today working as a public health researcher, the same analysis would have been done using a computer-based GIS application.

The original map from John Snow’s study. Source: Archive.org, page 44.

He likely would have also had unfettered access to municipal spatial files for the entire underground water and sewer line network, along with their maintenance records, exact pump locations, water consumption data, water quality testing results throughout the network, population density for each neighbourhood, and, finally, coroner or medical examiner reports on the cholera-related fatalities.

As it turns out, today’s data journalists could probably access most of those records as well.

Think of the possibilities.

Questions waiting to be answered by you

So, what are some concrete examples of how GIS can enhance your journalism?

These more advanced examples are just a few among many great ones, but the key is that they all look for patterns, outliers, and the connections between data.

An overview of crowd sizes from Reuter's piece on the 2019 Hong Kong protests, which leverages GIS.

Getting started with GIS: key concepts

1. GIS software

At some point, early in your exploration of spatial data and mapping, you’ll run into a situation where you’ll need to either convert a file type, modify the projection (more on this later), add attribute data, or make edits to a boundary.

For many journalists this represents the initial foray into QGIS, which is a free and open-source GUI desktop application. QGIS supports everything that a beginner would need; it also satisfies most of the needs of advanced users.

The other tool often used in newsrooms (and nearly always used by GIS professionals world-wide) is ArcGIS, a commercial software package. ArcGIS has some functionality that QGIS lacks, but because of the nature of the open-source community, plugins for enhanced features in QGIS are often available to help narrow that utility gap.

A good starting point is to download QGIS and follow along with their A Gentle Introduction to GIS tutorial.

Depending on your level of knowledge and programming skills, you can also explore how to perform GIS analyses through code, using spatial packages for Python or R. I would first recommend getting familiar with GIS concepts using a desktop application, however.

You may reach a point where you’ll find QGIS, Python, or R cannot cannot efficiently process a high enough volume of data. In these situations, many analysts opt for a more powerful tool such as the popular Postgresql database spatial extension called PostGIS, which essentially stores your spatial data inside of a database and allows the user to query these records using a series of SQL-esque functions. But this falls far beyond the scope of this Long Read.

2. Spatial files types

Because most of us initially learned to visualise maps using Google-based applications, Keyhole Markup Language (KML) files were our first exposure to spatial data. This is the default filetype for Google Earth, Fusion Tables, and other Google mapping tools.

KML files are text-based and resemble XML or HTML structures. You may also encounter KMZ files which are simply compressed KML files that have been zipped to reduce storage size.

As you progress in your GIS learning, the next file type your will likely encounter is a Shapefile.

shapefile format can spatially describe vector features like points, lines, and polygons -- as seen in the above vector map. Credit: Wikimedia.

The ubiquitous Shapefile is actually a collection of files that are nearly always distributed in a single zipped folder. The key thing to know about this file type is that each file in the bundle -- some of which are mandatory, others optional -- serves a unique purpose.

The file with the extension ‘shp’ contains the information that draws the points, lines, or polygons on the maps. The ‘shx’ file contains indexing information which helps speed up processing times. The ‘dbf’ file contains all of the attributes about each element. These three files are required, otherwise your Shapefile will not function properly.

Another common file contained in this bundle (but not required) is the ‘prj’ file, which specifies the projection to be used when the file is loaded (more on this in the next section).

Lastly, spatial files are increasingly being made available in GeoJSON formats. You may initially find this format confusing, but it is a highly efficient way of storing spatial data. GeoJSON files can be parsed natively by JavaScript, which makes it ideal for many custom interactives.

One thing to note with spatial data is that it comes in two distinct flavours: vector and raster.

In most journalistic applications, vector files are used, however, it’s important to be aware that many other industries make use of raster files. Raster data often comes in the form of satellite imagery or aerial photographs, where the values given to each cell or pixel is the data itself (for example, a specific shade of green for a certain pixel in a satellite image could represent a type of vegetation). These types of data are frequently used in forestry and natural resources management.

3. Coordinate reference systems and projections

If you want to save yourself hours of frustration and troubleshooting, pay close attention to this section as it is foundational to GIS.

From my personal experience and from assisting other reporters over the years, a lack of clear understanding of how projections and coordinate reference systems work is the cause of nearly all errors for beginners and intermediate users alike.

So, what are projections?

The concept of projections comes from the fact that there is no perfect way to represent the surface of a sphere on a sheet of paper (or a computer monitor for that matter). To illustrate this point: take an orange and, after removing the peel, try to lay the skin flat on a table. See the problem?

Over the years, cartographers have come up with many different methods for overcoming some of these limitations, but none are perfect. These varying approaches for displaying the world on a flat surface come in different class families, are known as ‘projections’ and there are close to 6,000 unique ones for applications of all types.

Various map projects found in QGIS documentation.

Coordinate reference systems provide such frameworks for defining real-world locations. They come in two types: geographic coordinate reference systems and projected coordinate reference systems.

When you began working with maps, say on a Google platform, it’s likely that you simply uploaded your KML file or geocoded a series of latitude and longitude coordinates, and then proceeded to visualise them on a web mapping service, never considering there was a very specific coordinate reference system being assigned by default.

What you probably didn’t realise at the time was that you were likely working with a geographic coordinate reference system known as WGS84. This is the standard for most GPS devices and many online mapping services. Sometimes this coordinate reference system is represented as EPSG:4326, which is simply a different coding system for projections. You will often hear people refer to WGS84 as a ‘projection’ and, while this is technically incorrect, it is often acceptable to refer to it as such when speaking in general terms.

A key thing to remember is that when you are working with coordinates in decimal degrees, the units of measurement are in degrees. Hold this thought for now.

With projected coordinate reference systems, rather than working with angles (decimal degrees) on a sphere, you are nearly always working with coordinates on a two-dimensional plane with an X (longitude) and Y (latitude) axis. The unit of measurement can be metres, kilometres, feet, miles, and so on.

Depending on the purpose and especially the location of your work, it's important to select an appropriate projection and to recognise their limitations.

When you observe a traditional Mercator world map, the size of countries closer to the poles are exaggerated, while countries closer to the equator are minimised.

There is no better way to illustrate this point than by using the online The True Size… tool, which allows a user to click and drag countries over the top of each other in order to compare their actual surface areas. This reinforces the fact that every map projection introduces some form of distortion.

A geographic comparison of Greenland's actual size from thetruesize.com.

As the authors of the website point out: “Greenland appears to be roughly the same size as Africa. In reality, Greenland is 0.8 million sq. miles and Africa is 11.6 million sq. miles, nearly 14 and a half times larger".

I highly recommend watching this exceptional explainer video on map projections to better understand this concept and how it warps our sense of reality.

Why does all of this matter?

First, if you intend on measuring distances between cities, adding a buffer zone to a contaminated site, or calculating the surface area of an electoral district, the accuracy of your measurements could be jeopardised if you don’t select the appropriate coordinate reference system suited for your task. Be mindful of the scale and extent of your data. If your data spans only a city, your coordinate reference system should be different than for spatial data that spans an entire continent.

Secondly, remember how the coordinate reference system based on latitudes and longitudes uses degrees for its unit of measurement? It’s likely that you will want to be working with metres or kilometres for your project. In this case, you’ll need to convert your vector layers to a projection that uses your desired units.

Finally -- and this is extremely important -- if you intend to study the relationship between two spatial files, you must first make sure that both have matching projections. If you keep getting an error while trying new tools, this should always be the very first thing you verify. Best practice is to convert all of your spatial data to the same projection before you begin.

Read more about projections from the QGIS documentation.

A hypothetical walk-through: GIS in daily news

The following is a hypothetical newsworthy scenario paired with a walkthrough of potential GIS applications. Note that the suggested documentation all assumes the user is working with QGIS as their software of choice. This is not intended to be a step-by-step tutorial, but rather a high-level example of the mechanics of leveraging GIS tools for original news content.

1. Creating buffers

Scenario:

There has been a train derailment in your community and authorities say a chemical spill could have adverse health effects for residents within a 1,500 metre radius.

GIS tool to use:

After geocoding the address (or finding the latitude and longitude) corresponding with the location of the derailment, create a new vector layer in QGIS. As always, convert the layer to an appropriate projected coordinate reference system (one that uses meters as its base unit).

Then, use the geoprocessing tool called Buffer to enlarge the point on the map that represents the accident location to a circular boundary with a radius of 1,500 metres. This will result in the creation of new layer containing a single polygon.

The diamond represents the location of the location of the derailment and chemical spill along a stretch of rail. Using the buffer tool, a new layer representing 1500 buffer around the rail accident location is created.

2. Centroids

Scenario:

Continuing with the previous scenario, in addition to visualising the danger-zone on a map for your readers, you may also want to report on the number of homes that fall within this buffer zone. This is often a two step process.

GIS tool to use:

From the previous step, you have a layer that contains a polygon representing the danger zone for the chemical spill. If you have access to it, get a copy of a municipal property parcel spatial file. As I’ve mentioned, the first step is to make sure the coordinate reference system of this new file matches the one employed by your buffer zone layer.

After importing the municipal property spatial data, all the non-residential properties were filtered out using the QGIS query builder.

From here, you will want to filter out all properties that are not zoned as residential, because we’re ultimately interested in area where people are most likely to live as opposed to parks or industrial areas. Once you have done so, the next step is to generate a new layer representing the centroids of all those residential parcels. A centroid is essentially an algorithm-generated point that represents the centre of a polygon.

To achieve this, use the vector geometry tool called Centroids. This process will output a new layer with a series of points.

Using the Centroid tool, a new layer was created representing the centroid of each residential property parcel. The question now is: how many homes are in the buffer area?

3. Points in polygon analysis

Scenario:

Once you have your centroids layer (double check you have a matching coordinate reference system) you want to perform an analysis to get the total number of points (that is, residential properties) that fall within your buffer layer.

GIS tool to use:

While QGIS has a convenient one-click function intuitively called Count points in polygon, under the hood, this tool is actually testing the spatial relationship between each centroid derived from your property data, against the polygon you created using the buffer tool. Using an intersects operation, the function returns TRUE or FALSE for each centroid, ultimately providing you with a total count of all the TRUEs.

To get this final result, select the Count points in polygon analysis tool. After running this process, a new polygon layer will be generated with the exact same content as the buffer zone polygon layer, but will now contain an additional attribute field with the point count. This value is the number of homes in the danger zone that you will report in your story.

The point in polygon tools enable you to quickly calculate the total number of centroids (that is, homes) that overlap with the buffer area. The answer to this question can be found in the new column called “NUMPOINTS” in the attribute table of the resultant layer after execution.

From here, you can even take your story a step further and use a nearest-neighbour analysis to identify the addresses of the top 100 homes closest to the chemical spill. With this information in hand, you can elevate the impact of your story by including human voices of those most affected by this disaster.

Conclusion

In terms of the overall potential that GIS brings to your newsroom, this use case is merely the tip of the iceberg. For example, a journalist with more advanced GIS skills can perform vehicle routing analyses to shed light on emergency response times across their community, or even help to identify hotspots or categories of crimes using a clustering algorithm.

There are endless resources online to advance your learning journey, including the official documentation for your tool of choice, YouTube tutorials, forums or websites such as Medium, Stackoverflow, GitHub, Reddit, and much more. My experience has been that the GIS community at large is very welcoming of journalists seeking to immerse themselves into this field of study; reaching out to these professionals to serve as informal mentors can give you a confidence boost when you get stuck or if you are unsure your work is accurate.

So...

...Are you convinced your next learning curve should be diving into the world of GIS? Let us know by commenting below.

For more on geography and maps, check out:

]]>
The dos and don'ts of predictive journalism https://datajournalism.com/read/longreads/the-dos-and-donts-of-predictive-journalism Thu, 14 Nov 2019 09:29:00 +0100 G. Elliott Morris https://datajournalism.com/read/longreads/the-dos-and-donts-of-predictive-journalism The author Michael Lewis writes in The Undoing Project, his best-selling biography of Nobel prize winner Daniel Kahneman and Amos Tversky, that “knowledge is prediction”. When we assert that something is true, Michael Lewis argues, we are drawing on the evidence we’ve stored in our brains to anticipate the reality of a fact, event, or situation. We combine that evidence to predict (with varying degrees of certainty) that a particular phenomenon can or cannot be. To know something is to predict that it’s true -- and vice versa; to predict something is to know what makes it true.

Journalists have adopted some variation of this axiom and, in recent years, embraced predictive journalism with open arms. Most major media outlets now have dedicated data journalism teams to model data and create stories driven by prediction.

Primarily, we turn to prediction so that we can know something empirically and convey it to our readers. Social science has also given journalists the tools to independently obtain credibility for which they previously had to rely on others. Modelling data also gives way to data visualisation, a useful journalistic tool in the age of digital media. It makes sense that predictive journalism has gone mainstream. Outlets now forecast everything from the Oscars and elections, to house prices and coups d'état.

To know something is to predict that it’s true

Journalism’s increasingly empirical bent is for the better; once journalists know what makes something tick, they can tell readers why.

But forecasting is complicated. The chance that a prediction can go awry is higher than a lot of people realise. Bad data, poor modelling, and insufficient communication are all real threats to even seasoned journalists.

Minimising the risk of bad prediction is not easy. Even the best forecasting models, which may combine hundreds of variables with a variety of statistical methods, still often come up short. Modellers run into several fundamental issues with prediction: A common quip is that the past is not always predictive of the future, some outcomes are intrinsically unpredictable, and subtle differences in methodological choices can lead to big differences in results. And then there’s the rather tall task of communicating results -- including our predictions and our confidence in them -- to readers.

To predict something is to know what makes it true

It’s clear that journalists thus face significant burdens to making good forecasts. This Long Read assists the evidence-based reporter in making sense of the promises and perils of predictive journalism. We will learn a bit about prediction, a bit about statistics, and a bit about journalism.

To start, we will outline a philosophy of good prediction. We will applaud some methodological choices and decry others; only a fair accounting of forecasting in journalism can properly prepare us to do good, not harm, with the massively powerful tools that are available to scientists today.

Then, we will see how good predictive journalism draws on three factors: first, an adequately-specified statistical model; second, communication of the results and the underlying (un)certainty of the model; and third, hand-holding for the reader as they make sense of forecasting wonkery. But before we get there, let’s put ourselves in the correct frame of mind for making helpful predictions.

Thinking about prediction

On 7 September 2007, Mark Zandi, chief economist at Moody's Analytics and co-founder of Economy.com, now a subsidiary of Moody’s, told CNN he was bearish about the threat of job loss and a harsh economic recession. “I don’t think consumer spending will fall unless the job market is contracting,” Zandi said, “and I’m fundamentally optimistic we won’t see job loss”. Three months later, the global financial system entered its worst downturn since the Great Depression. How did Zandi -- and the many others who agreed with his analysis -- get it so wrong?

In his 2012 book, The Signal and The Noise, Nate Silver, the founder and CEO of data journalism website FiveThirtyEight, wrote that financial analysts made a fundamental mistake in analysing the risks of an economic crisis. Nate wrote that Standard & Poor’s (S&P), a financial company that publishes the S&P 500 stock index, based their analysis of the risks of popular mortgage bundles called Collateralized Debt Obligations (CDOs) on the chance of individual mortgage defaults. The risk of a CDO was calculated as the risk that any one consumer would default on their loan _independently _of whether someone else also defaulted.

The industry thus missed the chance that if the one person defaulted and their house payments, others in the CDO were likely to default as well, especially if everyone was facing similar effects of a contracting economy. In statistical terms, the firms failed to estimate the risks of correlated outcomes. According to Nate Silver, Standard and Poor’s was estimating the chance of default for mortgages within CDOs at 0.12%. In reality, they had a 28% default rate.

Good forecasters are forward-thinking and open-minded

Journalists can learn two things from the forecasting errors made in the run-up to the financial crisis. First, it is clear that history is not always the best guide for the future. According to a 2004 paper from the Bank for International Settlements, the correlation structure of Standard & Poor’s CDOs was based “on historically observed defaults” of American mortgages. But since the history of those mortgages did not take into account the chance of a systematic increase in rates of default, they underestimated the total risk of financial collapses. Just because there hadn’t been a great recession in their training data did not mean there would not be one in the future.

Second, the financial crisis teaches empirical journalists to take the risks of overconfidence into account when making their predictions. If a model tells you that an event has a 0.1% chance of happening, you should ask yourself about the consequences that your model is wrong, and how those consequences relate to the stated probability. In the case of the financial crisis, most investors likely did not consider that the risks of buying CDOs could be miscalculated -- and given their given risk of roughly 0.12%, that would indeed have been silly. But if S&P had told them that their CDOs would bust nearly one time out of ten (or 10%), they likely would have thought again about investing their money.

The average person -- and reader -- thinks the same way. Told that the chance of rain is 0.12%, I would never bring an umbrella with me to work. But if there were a nearly 1-in-3 chance of a downpour, I would consider putting one in my bag. “Just in case”, we might mutter to ourselves on our way out the door. But investors were given no reason to say “just in case” in 2007.

Good forecasters know their problem

Good predictions also cover all the bases of a problem. They are created by people with deep domain knowledge on the subject in question; when we attempt to answer a problem with a blind reading of the data, we may miss something crucial.

Take, for example, differences in the predictions made for the Academy Awards by TIME Magazine and those made by FiveThirtyEight.com. In 2019, TIME data editor Chris Wilson and University of Virginia statistics professor Christopher Franck declared that Roma, a moving film about a house worker in an affluent Mexico City neighborhood, was favoured to win Best Picture (it had a 46% chance, they claimed). But Roma did not win; instead, Green Book, a film about an African American pianist touring the 1960s American south, did, despite their model giving it just a 1.7% chance of victory. In their piece, the authors said that neither of them had even watched Roma. They simply turned the data over to a statistical model that plucked out winners and losers with some degree of confidence.

Chris Wilson and Christopher Franck might say that the Oscars are inherently unpredictable. But would they have created a different model if they had studied film and award shows for a living?

The Academy’s decisions take into account all sorts of variables, like the racial composition of winners and losers and the public’s demands for certain stars to win the award, and the algorithm for determining winners is opaque and complex. Yet, the fact remains that Walt Hickey, once a culture writer for FiveThirtyEight and now senior editor for data at INSIDER, devised a model in 2018 that correctly picked the winner for each of the top seven Oscar award categories. Walt nailed his predictions for best actor and actress, best animated feature, best supporting actor and actress, best director and best picture. (He ‘missed’ Icarus, the year’s best documentary, though he correctly identified the race as a toss-up). Walt may have obtained such a record because he spent the past decade predicting winners of the award. He knows which other film awards best correlate with winning the Oscar -- and perhaps more importantly, he knows why. Walt combined his own knowledge with his model to get the best predictive results.

A glimpse into Walt Hickey's model for 'best director.

Good forecasters know their tools

It is important to note that the tension between domain knowledge and data can cause problems for effective inference; knowing your statistical tools is equally as important. But in scenarios when your qualitative analysis tells you not to trust in your data, you should not default to shirking off prediction entirely. On the contrary, domain knowledge includes both the context of the data your model relies upon and the data it does not. Often, the inputs to our models are an imperfect match with the problem we’re solving, but there is nevertheless no better alternative. An embrace of other data just because our primary observations are imperfect can cause problems for inference. To see what I mean, let’s explore two prominent predictions made about the 2017 presidential election in France.

Then, Ian Bremmer, the president of a risk consultancy called the Eurasia Group, threw quite a tissy when members of The Economist’s data team predicted that Marine Le Pen, the leader of France’s populist/nationalist party, the National Front, had less than a 1% chance of becoming the country’s next president. Ian wrote that the prognostication was the paper’s “biggest mistake in decades”. He asserted that Le Pen actually had a 40% chance of victory -- about 4,000 times the newspaper’s estimates.

How did the two forecasters arrive at such a different analysis of risk? The Economist’s prediction was based on a statistical analysis of the historical relationship between French election polls and election outcomes, along with a probabilistic study of outcomes using a simulation method called Markov chain Monte Carlo. Ian Bremmer, on the other hand, asserted that popular frustration with immigration and the European Union meant the paper’s reliance on polling was flawed, and that Marine Le Pen was much more popular than the data indicated.

In the second round of the election, the polls prevailed and Emmanuel Macron, a liberal centrist, defeated Marine Le Pen by more than thirty points. How could Ian Bremmer be so wrong?

The success of The Economist’s journalists was in trusting that polls would capture any differences in support for either candidates caused by the underlying factors that Ian Bremmer identified. The new populist movement in France did not make polls less important, as he claimed, but more important. In the end, Ian’s prediction was based on a subjective analysis of unrepresentative social media data -- and was no match for the predictive power of a probabilistic analysis of political polling.

Communicating prediction

Once we have made predictions, we must communicate them to readers. This is no easy task: Most people do not understand statistical jargon like “the proportion of variance explained is x%” or “the impact of this variable is statistically significant”. These concepts must be simplified and explained to be of any use to the layperson. Journalists can follow three main steps to ensure good communication for their predictions: first, explain your results; second, communicate your uncertainty; and third, use visualisation to show, rather than tell, your reader about the predictions.

Explain the result

Although modelling can reveal causal relationships in our data, primarily, the end result of data journalism is explanatory. Our predictions are typically sandwiched between texts or highlighted in graphics that explain their significance. Through the course of those explanations, a journalist should highlight at least (a) what went into the prediction and (b) what came out, if not also (c) what happened to the data in between. We will review a few examples of this now.

The New York Times’s The Upshot blog_ _can sometimes take a hands-off approach to the data processing part of their predictions and a hands-on approach to communication. A good example of this is a 2014 story entitled How Birth Year Influences Political Views by Amanda Cox. In the piece, Amanda discusses the results of a predictive model of voter preferences put together by political scientists Yair Ghitza and Andrew Gelman. She describes the inputs and overarching formula of the model simply, but divulges enough details for the lay reader to know what’s going on:

The model, by researchers at Catalist, the Democratic data firm, and Columbia University, uses hundreds of thousands of survey responses and new statistical software to estimate how people’s preferences change at different stages of their lives. The model assumes generations of voters choose their team, Democrats or Republicans, based on their cumulative life experience — a “running tally” of events.

Then, Amanda presents the main graphic from their article:

This graphic does many things right. First, it presents both the study’s prediction and confidence interval. It is also annotated to tell the reader what to take away from it, on the off chance they don’t know how to read a graph that plots a coefficient and statistical significance over time. Amanda goes big on highlighting the explanatory part of Yair Ghitza and Andrew Gelman’s work. She even goes on to graph the predictions for how loyal different generations are to either party, conveying the real-world importance of the work.

In some scenarios, data journalists will also choose to visualise ‘what goes in’ to the model. Among those that do, FiveThirtyEight’s work really shines through. In their predictions for the 2018 mid-term elections to the US House of Representatives, Nate Silver and his colleagues list all of the predictors used in their models and their values:

And they also show how different components of their predictions are combined in their models:

Finally, they also visualised their predictions and uncertainty!

These two approaches to predictive journalism show that there isn’t one ‘best’ way to communicate our forecasts. But there are ways that are better than others. Such is the topic of our next section.

Communicate your (un)certainty

As Amand Cox and Nate Silver et. al. show, good empirical journalists will present both their predictions and the (un)certainty with which they are made. Communicating your (un)certainty avoids both leading your reader astray and the inevitable backlash journalists receive when they predict something ‘incorrectly’. As we will see, properly accounting for communicating uncertainty may also help to avoid adverse impacts on our behaviours.

Let us again revisit American election forecasting for two prescient examples of why communicating our certainty is important. But first, it is worth noting that communicating uncertainty can only come after properly measuring the error of your model.

Measuring uncertainty: the 2016 election

As contentious as it was for the general public, the 2016 presidential election may have been even more combative for professional election forecasts. Two prominent prognosticators in particular butted heads; Nate Silver and Sam Wang, a Princeton Neuroscientist turned political statistician. Nate had been harping on Sam’s work publicly since at least 2014, when he wrote a long takedown of the professor’s methods:

[Wang’s] model is wrong — not necessarily because it shows Democrats ahead (ours barely shows any Republican advantage), but because it substantially underestimates the uncertainty associated with polling averages and thereby overestimates the win probabilities for candidates with small leads in the polls. [....] Wang projected a Republican gain of 51 seats in the House in 2010, but with a margin of error of just plus or minus two seats. His forecast implied that odds against Republicans picking up at least 63 seats (as they actually did) were trillions-and-trillions-to-1 against.

In 2014, according to Nate, Sam Wang assumed that the errors inherent in political polling were too low. In 2016, he repeated his error. The model Sam built evaluated a range of plausible election scenarios in which Hillary Clinton’s eventual national vote share was within roughly 1.6% of her vote share in the polls. But such an error margin was far too slim. Harry Enten, writing for FiveThirtyEight.com, found the margin of error for the average of national election polls going back to 1968 was closer to 5.6% -- nearly four times as large as Sam Wang’s! Such a difference has a big impact on a model’s probabilistic estimates. Had he used Enten’s calculation, his model would have indicated that Clinton had an 84% chance of winning the presidency rather than a 99% shot.

Sam Wang’s errors show the importance of a proper calibration of uncertainty in our modelling. Once we have a proper measure of the uncertainty of both our data and model, we turn to the tough challenge of communication.

Communicating uncertainty: the 2020 election

Now that election forecasting has gone mainstream -- something that associate professor of journalism at the University of Minnesota Benjamin Toff calls the ‘Nate Silver effect’ -- plenty of people are jumping into the game. One of these new prognosticators is Rachel Bitecofer, assistant director of the Wason Center for Public Policy at Christopher Newport University. Rachelhas proclaimed that Trump “will lose” re-election in 2020. As Sam Wang’s example in 2016 shows, this is misguided. But the problem is not necessarily that Rachel’s model is misspecified, as was Sam’s primary problem. Rather, the bigger issue is that she is not properly explaining the uncertainty of her predictions to readers.

The average reader would assume that Rachel Bitecofer’s language implies that Trump has no chance of winning the 2020 presidential election, but her work actually tells quite a different story. Once you take uncertainty in mind, her “will lose” looks more like a “will probably lose”, which is an important distinction.

Rachel Bitecofer’s work (graphed below) shows that the Democrats are just barely favoured to win the 270 votes in the Electoral College required to win the presidency (and that they aren’t even guaranteed to win them). Her model shows that only 197 electoral votes ‘will’ be won by Democrats (those votes that are from ‘safe’ Democratic states). The other 81 predicted Democratic votes are up for grabs with varying degrees of certainty -- from what analysts call ‘lean’ or ‘likely’ states. In fact, if you multiply the implied state-level probabilities from her model with each state’s electoral votes, the resulting prediction is 300 electoral votes for Democrats -- just 22 more than they need to win the presidency. Given that the model does not perfectly explain voter behaviour -- a similar method explained only 70% of voter behavior in the 2018 House and Senate midterm elections -- the predictions hardly say that Trump ‘will’ lose in 2020.

Rachel Bitecofer’s work is a good example of why properly communicating the certainty of our predictions matters. The way she does so clearly obscures the actual story of her method. But to be sure, she is not the only analyst guilty of obscuring her forecasts, especially when it comes to politics.

One problem is that Rachel Bitecofer’s electoral college map obscures the nature of the election two-fold. First, since the map is scaled by geography and not actual electoral college votes, it over-emphasises the number of electoral votes in states with smaller populations and minimises votes from high-population states. More importantly, it does not show the distribution of outcomes for her prediction. How then could we improve on a simple visualisation of predicted electoral college outcomes?

Again, The Upshot blog offers a good example of a possible solution. In 2016, the site’s authors also created their own model for the election. It was not extraordinarily different than the rest, but the way they communicated their findings sets a good example for properly communicating uncertainty.

By graphing the probability of possible electoral college outcomes with a histogram, the Upshot_ _put readers up close and personal with probabilistic thinking. Although they ultimately predicted that Hillary Clinton would win the electoral college with 323 electoral votes, displaying possible outcomes with a histogram showed that other outcomes -- including a Trump victory -- were actually very plausible. I suspect that a similar visualisation of Rachel Bitecofer’s work would reveal the actual uncertainty in her forecasts.

The Economist used similar visualisations to convey their predictions of the 2018 midterm elections to the US House. For their main graphic, they chose to represent cumulative probabilities of seat shares with a bar chart; each bar represents the chance that Democrats/Republicans would win the House by a certain number of congressional seats. This way, they showed there was a large range of outcomes -- that is, a lot of uncertainty! -- in their predictions. They simultaneously showed how unlikely it was for Republicans to maintain control of the chamber.

Why communicating uncertainty matters

The merits of properly accounting for and communicating the uncertainty of our predictions extend beyond simply the accuracy of our journalism. Some research shows that bad predictions might even impact our behaviours. In a 2018 academic paper, social scientists Sean Westwood, Solomon Messing, and Yphtach Lelkes suggest that the certainty of election forecasts could have impacts on who actually turns out to vote. They estimate that “divergence of 20% from even odds” in a game that simulates election forecasts “lowered voting by 3.4% (with a 95% confidence interval between 1.8% and 5.1%) and divergence of 40% lowered voting by 6.9% (CI: 3.6% and 10.2%)”. Although the game is not a perfect representation of reality, in elections when some states are decided by mere percentage points, even small decreases in turnout could have large impacts on probability.

Some predictive journalism is useful

Data journalists should be constantly considering that axiom from Michael Lewis: “All knowledge is prediction”. To know is to predict. But to predict is also to know, and to predict _incorrectly _is to know incorrectly.

This Long Read has provided some guidance on how to conduct ourselves properly when creating and refining our predictive journalism. Good forecasters should know their subject, ensure that their predictors can measure the future as well as the past, know their tools, explain the ins and outs of their predictions and emphasise the (un)certainty with which they are making their forecasts.

Journalists might also consider that famous axiom from statistician George Box: “All models are bad, but some are useful”. We must strive to ensure that we put our work in the latter category.

]]>
Data journalism: a guide for editors https://datajournalism.com/read/longreads/data-journalism-a-guide-for-editors Wed, 30 Oct 2019 09:56:00 +0100 Maud Beelman Jennifer LaFleur https://datajournalism.com/read/longreads/data-journalism-a-guide-for-editors The best ‘data stories’ are not obvious. They don’t hit the reader over the head with numbers, at least not initially. But the data is the very foundation on which the story is built, and it can help guide reporters to the best anecdotes or ways to illustrate their findings.

Good data editing requires an understanding of all that, along with critical thinking, project management skills, and a better-than-average understanding of the content, context, and organisation of the data.

Just as all investigations need a process that works for everyone, so do data investigations. Knowing the process can help editors ask the right questions and backstop reporters. It’s also important (and helpful to the overall story) for all team members to understand the methodology, regardless of whether they’ll be working with the data.

Good data projects have good workflows, which depend in part on backout scheduling. Know your publication deadline, and then back up each important prior mark that must be hit. Data-based investigations have more moving parts and more predicates, so it’s vital that everyone knows the order in which the work must get done.

Getting started: guidance for journalists and their editors

Journalists, start as you would any other investigation: with initial research and reporting to identify what data and documents exist. If you get stuck, look for examples of similar reporting (these can be found in the IRE Resource Center) or search scientific and academic works to identify experts who have done similar work or share the same interest in your topic. Often such sources can help you streamline your preliminary data research.

It’s important to remember that initial story memos or pitches should include information about what data is available and how that data might help you tell the story.

Once the data is identified and acquired or assembled (more on that later), it and any notes should be somewhere the entire team can access during the reporting and editing process. The key is to keep everyone in the loop so there are no surprises. When working on the data analysis, avoid using email to track changes or make updates, which can get lost or be confusing. If you have a project management system in your newsroom, you may want to use that, but tools such as Github or Google docs also work well.

If you’re an editor who doesn’t use programming, you should still make sure scripts contain comments, explaining what the code does, so that another person could follow along and understand your reporter’s data and methodology.

You also need to have an internal process for independently double-checking the data analysis. And keeping a data diary is essential to this end. For example, if you have a large enough team, one reporter could be the backstop for another. You also might ask a trusted colleague outside the organisation to be your backstop.

Data diaries can take many forms. Here’s one example.

Data diaries also come in handy for keeping track of file names, code, and syntax, and footnoting and lawyering copy. Mostly, they need to be clear enough that colleagues, including the editor, can make sense of the work.

Bulletproofing the data and its analysis

Bulletproofing a data driven investigation begins with bulletproofing the data before starting the analysis. This is important because often the simplest problems get overlooked, such as not having all of the relevant records. Editors who know this can backstop their reporters’ practices and ensure that their stories are set up for success from the get-go.

In addition, you should always work off a copy, instead of the original data, in case something bad happens while you’re doing the analysis, such as your computer dies or you accidentally introduce an error into the data. As you do your checks, always keep notes on what you found -- it will help you later when you do your analysis. Here are the checks that data journalists should conduct on every dataset, which can also be used by editors to backstop analyses:

  • Check that you have all of the relevant records. It’s easy for an agency to accidentally miss some records, either by copying and pasting data or reusing an old query that pulls only certain records. If there is no reference for the exact correct number, use common sense. Would it be reasonable that the United States would have only 80,000 voters?
  • Make sure all locations, such as cities or counties are included.
  • Look for inconsistencies in key fields. For example, are city names spelled the same way? It’s important because it could affect your results. You can do this check by getting a list of all unique possibilities within a given field and sorting them alphabetically.
  • Make sure that numeric fields are within valid ranges. For example, does your data include dates of birth that would make individuals too young or too old?
  • Check for missing data or blank fields. Make sure that you did not cause these problems by importing data incorrectly. Look at the file in a text editor to be sure.
  • Double-check totals or counts against summary reports from the agency.
  • Know your data. Know what every field means and how the agency uses it. Something that looks boring to you could be critical to your analysis.
  • Talk with the folks who work with the data and ask them about the checks they do.

Once you’ve checked your data, you’re ready to do your analysis. Keeping notes about what you do will be crucial at this stage. Those notes will help you write your methodology later and will help you (and your editor) vet the findings. As you go along with your analysis, be sure to regularly back up your data and use a naming convention that makes sense to you and to others who may use the data. Here are a few other tips to keep in mind as you undertake your analysis:

  • Make sure you’re using the right tool. You may need to do more than counting and sorting.
  • Check with experts from different sides of the issue about your methods and your findings.
  • Beware of lurking variables. The trend you found could be caused by an underlying variable you haven’t considered.
  • If you think you’re in over your head, call on an expert to help. Don’t guess or assume.
  • Double-check surprising results. For example, if citations spiked by 50% in one year, it could be a story or it could (more likely) be an error.

Often data ‘analyses’ are counting or summing data, but if you need to do a more complex analysis, here are some suggestions to help you figure out the best methodology:

  • Read research reports. Academic research on your topic might reveal best practices for working with your data.
  • Find an expert to vet your methodology. Many are happy to help, especially once they realise you’re interested in doing a serious analysis. When the Dallas Morning News examined jury strikes, one of the leading experts on bias in jury selection reviewed all of the reporting teams’ findings.
  • Show findings to the targets of the story. We’re not suggesting sharing your story, but you should put together a findings document or presentation that you can share with the targets of the story. This helps bulletproof your methodology by surfacing any problems that may exist (or variables you didn’t consider) before publication.
  • Duplicate your work. To make sure you didn’t mess something up along the way. Don’t just rerun original scripts, recreate them so you know they were done correctly the first time.
  • Maintain a consistent universe of cases. If you have to filter or redefine your universe, be able to explain why you isolated certain records or cases.
  • Give yourself enough time to follow through on collecting information for your database before you start writing. If you’ve built your own database, where information may need to be updated or will change after additional reporting, set a cut-off date and don’t make any more changes to the database unless the data is inaccurate or the new information will change the meaning of the story.
  • If you are doing the data entry yourself, make sure at least two people have reviewed every record, or consider hiring a data-entry firm that uses double-entry verification.

Bulletproofing the process

Editors of data investigations must ask even more questions than usual and do their own research. In much the same way that the reporter may have identified similar works of journalism or scientific studies, editors should familiarise themselves with those methodologies as well.

Also, it’s essential for editors to know and understand a database’s ‘record layout’, or more simply put, the kind of information that is contained in the data and how is it broken down and organised. If there is a ‘read me’ file that accompanies the data, which often describes known quirks or problems, it’s the editor’s responsibility as much as the reporter’s to read and understand those details.

You should discuss known or suspected problems in the data with your reporters and the whole project team. In fact, even if your reporters don’t bring them to you, ask what the problems are, because data always has problems. Listen to your reporters carefully, and have regular check-ins with them to see what’s worrying them. Look at the data yourself, or if you’re not conversant in the software, ask the reporter to give you a guided tour of the data. Don’t be shy about challenging the data, if there’s anything you don’t understand or that doesn’t pass the sniff test. Encourage creative thinking and brainstorm solutions.

Finally, have your reporters write their methodology (or a ‘white paper’ if it’s a long and complicated analysis) before they start drafting any story. Most often, methodologies (aka the ‘nerd box’) are written at the end of a project drafting process. But it’s not at all uncommon -- once a reporter has to explain all the details of how they conducted the analysis -- that the story language needs to change. Once a detailed methodology is written, it’s even possible that you find some misunderstandings amongst the team over what was done and how. It’s better to surface these issues before the writing begins.

A methodology story by the Associated Press.

Here are a couple of examples of handling methodology in copy. Aside from including a few paragraphs in your story, which is how simple methodologies can be handled, you can write a separate short story on what you did. You can also produce a more detailed white paper, which allows you to go into great detail on a complex analysis and can have the effect of creating greater confidence and transparency around your work.

To sum up, here are 10 questions every editor should ask:

  1. Does the data answer our questions? Does it surface other questions?
  2. Where did you find the data?
  3. How did you vet and clean the data?
  4. How did you calculate those numbers?
  5. Are you keeping a data diary?
  6. Did you replicate your data work? Could someone else?
  7. Have you consulted experts or done a scientific literature review?
  8. Do we need a white paper?
  9. Could you write a nerd graf/story if asked to?
  10. What is the significance of the data? (Don’t confuse effort with importance.)

Writing the data story

As we mentioned at the start, the best data stories are not data heavy. They don’t ask readers to ‘do the math’, and they don’t subject the narrative to a lot of numbers. They tell the story or stories that the data has surfaced, through interesting characters or circumstances. As you guide your reporters through the writing phase, consider some of the below examples.

Take, for instance, this Associated Press story that was part of a global investigation into medical implants, led in 2018 by the International Consortium of Investigative Journalists.

Not until the sixth paragraph do the writers introduce the idea that the topic of the story -- spinal cord stimulators -- is being examined because the devices rank among the top of those causing patient harm. In fact, the sixth and seventh paragraphs help form the nut grafs of the story: often where you find data first appearing in stories that take a narrative approach:

“But the stimulators — devices that use electrical currents to block pain signals before they reach the brain — are more dangerous than many patients know, an Associated Press investigation found. They account for the third-highest number of medical device injury reports to the U.S. Food and Drug Administration, with more than 80,000 incidents flagged since 2008.

Patients report that they have been shocked or burned or have suffered spinal-cord nerve damage ranging from muscle weakness to paraplegia, FDA data shows. Among the 4,000 types of devices tracked by the FDA, only metal hip replacements and insulin pumps have logged more injury reports.”

This is how The Philadelphia Inquirer started its story on the findings of a year-long investigation into how children in public schools were suffering from environmental poisoning:

“Day after day last September, toxic lead paint chips fluttered from the ceiling of a first-grade classroom and landed on the desk of 6-year-old Dean Pagan.

Dean didn’t want his desk to look messy. But he feared that if he got up to toss the paint slivers in the trash, he’d get in trouble.

So he put them in his mouth. And swallowed them.”

There’s no indication in the opening paragraphs that these findings are the result of a data analysis until the eighth and ninth paragraphs:

“As part of its “Toxic City” series, the Inquirer and Daily News investigated the physical conditions at district-run schools. Reporters examined five years of internal maintenance logs and building records, and interviewed 120 teachers, nurses, parents, students, and experts.*

When the newspapers analyzed the district records, they identified more than 9,000 environmental problems since September 2015. They reveal filthy schools and unsafe conditions — mold, deteriorated asbestos, and acres of flaking and peeling paint likely containing lead — that put children at risk.”

This is not to say that using the findings of a data analysis in your lead is always a bad idea. Each story, including its methodology and findings, needs to dictate the best approach to follow. Consider these examples:

The Post and Courier in South Carolina won a Pulitzer Prize for its 2014 domestic violence investigation, in which the data analysis was the lead because the numbers were so startling:

“More than 300 women were shot, stabbed, strangled, beaten, bludgeoned or burned to death over the past decade by men in South Carolina, dying at a rate of one every 12 days while the state does little to stem the carnage from domestic abuse.”

Compared to this narrative-based example from ESPN, on what ends up in some US stadium foods:

“Most Cracker Jack boxes come with a surprise inside. At Coors Field in Denver, the molasses-flavored popcorn and peanut snacks came with a live mouse.”

Illustrations, graphics, and videos are often your best friends in presenting data stories, as they can do the heavy lifting of the analysis, allowing your reporter’s storytelling (in any format) to flourish in the findings, not drown in the data.

Some examples below illustrate how data can achieve storytelling goals:

A collaborative investigation into the death toll in Puerto Rico caused by Hurricane Maria, which was named 2019 investigation of the year by the Data Journalism Awards. The powerful interactive embedded in the story showed how the numbers grew beyond initial reports. Greatly. It also included a searchable database with profiles of the dead.

Quartz and Puerto Rico’s Center for Investigative Journalism.

A look at how thin models must be to walk the catwalk by NOS, Netherlands, which uses a combination of video and graphics to illustrate the findings of an analysis into the sizes of 1000+ models.

Ocean Shock, a Reuters investigation into the effect of the climate crisis on marine life, and a stunning and engaging visual presentation of data.

ESPN’s 2018 analysis of food-safety inspection reports for professional sports venues; this powerful data presentation allowed writers to deliver up the mouse lead above. The graphics provide lots of numbers without overwhelming the reader.

One last thing...

Earlier, we referenced the possibility of doing an investigation based on data you assemble and analyse yourself.

In many ways, that is the most original form of data investigation because you’re not analysing some other entity’s information but rather doing the ground-up reporting that will give you truly unique findings.

That’s the up side. The downside is that this form of investigative data analysis is extremely labor intensive and fraught with potential methodology questions and errors. It requires more time and greater levels of bulletproofing from reporters and their editors, so plan accordingly if you decide it’s the only way to answer that burning question.

]]>
De-identification for data journalists https://datajournalism.com/read/longreads/de-identification-for-data-journalists Wed, 16 Oct 2019 11:39:00 +0200 Vojtech Sedlak https://datajournalism.com/read/longreads/de-identification-for-data-journalists In the pursuit of a story, journalists are often required to protect the identity of their source. Many of the most impactful works of journalism have relied upon such an arrangement, yet the balancing act between publishing information that is vital to a story and protecting the person behind that information can present untold challenges, especially when the personal safety of the source is at risk.

These challenges are particularly heightened in this age of omnipresent data collection. Advances in computing technology have enabled large volumes of data processing, which in turn promotes efforts to monetise data or use it for surveillance. In many cases, the privacy of individuals is seen as an obstacle, rather than an essential requirement. Recent history is peppered with examples of privacy violations, ranging from Cambridge Analytica’s use of personal data for ad targeting to invasive data tracking by smart devices. The very expectation of privacy protection seems to be withering away in the wake of ongoing data leaks and data breaches.

With more data available than ever before, journalists are also increasingly relying on it in their reporting. But, just as with confidential sources, they need to be able to evaluate what information to publish without revealing unnecessary personal details. While some personal information may be required, it’s likely that most stories can be published without needing to identify all individuals in a dataset. In these cases, journalists can use various methods to protect these individuals’ privacy, through processes known as de-identification or anonymisation.

To help journalists champion responsible and privacy-centric data practices, this Long Read will cover how to:

  • identify personal information
  • evaluate the risks associated with publishing personal information
  • utilise different de-identification methods in data journalism.

Defining personal information

While the definition of what constitutes personal information has become more formalised through legal reform in the late 2000s, it has long been the role of journalists to uncover if a release of data, whether intended or accidental, jeopardises the privacy of individuals. After AOL published millions of online search queries in 2006, journalists were able to piece together individual identities solely based on individuals’ search histories, including sensitive information about some individuals’ health statuses and dating preferences. Similarly, in the wake of Edward Snowden’s revelations of NSA spying, various researchers have shown how communication metadata -- information generated by our devices -- can be used to identify users, or serve as an instrument of surveillance.

But, when using a dataset as a source in a story, journalists are put in the new position of having to evaluate the sensitivity of the information at hand themselves. And this assessment starts with understanding what is and isn’t personal information.

Personally identifiable information (PII), legally described as ‘personal data’ in Europe or ‘personal information’ in some other jurisdictions, is generally understood as anything that can directly identify an individual, although it is important to note that PII exists along a spectrum of both identifiability and sensitivity. For instance, names or email addresses have a high value in terms of identifiability, but a relatively low sensitivity, as their publication generally doesn’t endanger an individual. Location data or a personal health record may have lower identifiability, but a higher degree of sensitivity. For illustration purposes, we can plot various types of PIII along the sensitivity and identifiability spectrums.

PII exists along a spectrum of sensitivity and identifiability.

The degree to which information is personally identifiable or sensitive depends on both context and the compounding effect of data mixing. A person’s name may carry a low risk in a dataset of Facebook fans, but if the name is on a list of political dissidents, then the risk of publishing that information increases dramatically. The value of information also changes when combined with other data. On its own, a dataset that contains purchase history may be difficult to link to any given individual; however, when combined with location information or credit card numbers, it can reach higher degrees of both identifiability and sensitivity.

In a 2016 case, the Australian Department of Health published de-identified pharmaceutical data for research purposes, only to have academics decrypt one of the de-identified fields. This created the potential for personal information to be exposed, prompting an investigation by the Australian Privacy Commissioner. In another example, Buzzfeed journalists investigating fraud among pro tennis players in 2016 published the anonymised data that they used in their reporting. However, a group of undergraduate students was able to re-identify the affected tennis players by using publicly available data. As these examples illustrate, a journalist’s ability to determine the personal nature of a dataset requires a careful evaluation of both the information it contains, and also the information that may already be publicly available.

While these tennis players’ names may appear anonymous, BuzzFeed’s open-source methodology also included other data which allowed for the possibility of re-identification.

What is de-identification?

In order to conceal the identity of a source, a journalist may infer anonymity or use a pseudonym, such as Deep Throat in the case of the Watergate scandal. When working with information, the process of removing personal details is called de-identification or, in some jurisdictions, anonymisation. Long before the internet, data de-identification techniques were employed by journalists, for example by redacting names from leaked documents. Today, journalists are armed with new de-identification methods and tools for protecting privacy in digital environments, which make it easier to analyse and manipulate ever larger amounts of data.

The goal of de-identifying data is to avoid possible re-identification, in other words, to anonymise data so that it cannot be used to identify an individual. While some legal definitions of data anonymisation exist, the regulation and enforcement of de-identification is usually handled on an ad-hoc, industry-specific basis. For instance, health records in the United States must comply with the Health Insurance Portability and Accountability Act (HIPAA), which requires the anonymisation of direct identifiers, such as names, addresses, and social security numbers, before data can be published for public consumption. In the European Union, the General Data Protection Regulation (GDPR) enforces anonymisation of both direct identifiers, such as names, addresses, and emails, as well as indirect identifiers, such as job titles and postal codes.

In developing their story, journalists have to decide what information is vital to a story and what can be omitted. Often, the more valuable a piece of information, the more sensitive it is. For example, health researchers need to be able to access diagnostic or other medical data, even though that data can have a high degree of sensitivity if it is linked to a given individual. To strike the right balance between data usefulness and sensitivity, when deciding what to publish, journalists can choose from a range of de-identification techniques.

An example of a redacted CIA document. Source: Wikimedia.

Data redaction

The simplest way to de-identify a dataset is to remove or redact any personal or sensitive data. While an obvious drawback is the possible loss of the data’s informative value, redaction is most commonly used to deal with direct identifiers, such as names, addresses, or social security numbers, which usually don’t represent the crux of a story.

That said, technological advances and the growing availability of data will continue to increase the identifiability potential of indirect identifiers, so journalists shouldn’t rely on data redaction as their only means of de-identification.

Pseudonymisation

In some cases, removing information outright limits the usefulness of the data. Pseudonymisation offers a possible solution, by replacing identifiable data with pseudonyms that are generated either randomly, or by an algorithm. The most common techniques for pseudonymisation are hashing and encryption. Hashing relies on mathematical functions to convert data into unreadable hashes. Encryption, on the other hand, relies on a two-way algorithmic transformation of the data. The primary difference between the two methods is that encrypted data can be decrypted with the right key, whereas hashed information is non-reversible. Many databases systems, such as MySQL and PostgreSQL, enable both the hashing and encryption of data.

Data pseudonymisation played an important role in the Offshore Leaks investigation by the International Center for Investigative Journalism (ICIJ). Given the vast volume of data that needed to be processed, journalists relied on unique codes associated with each individual and entity that appeared in the leaked documents. These pseudonymised codes were used to show links between leaked documents, even in cases when the names of individuals and entities didn’t match.

Information is considered pseudonymised if it can no longer be linked to an individual without the use of additional data. At the same time, the ability to combine pseudonymised data with other datasets renders pseudonymisation a possibly weak method of de-identification. Even by using the same pseudonym repeatedly throughout a dataset, its effectiveness can decrease, as the potential for finding relationships between variables grows with every occurrence of the pseudonym. Finally, in some cases, the very algorithms used to create pseudonyms can be cracked by third parties, or have inherent security vulnerabilities. Therefore, journalists should be careful when using pseudonymisation to hide personal data.

In 2013, Jonathan Armoza identified taxi trips made by celebrities Bradley Cooper and Jessica Alba from a dataset of New York taxi rides, where the taxi’s license and medallion numbers were supposedly hashed. To crack the code, he simply searched images of celebrities getting out of taxis and combined it with other information available in the dataset.

Statistical noise

Since both data redaction and pseudonymisation carry the risk of re-identification, they are often combined with statistical noise methods, such as k-anonymization. These ensure that at least a set number of individuals share the same indirect identifiers, thereby obscuring the process of re-identification. As a best practice, there should be no less than 10 entries with unique combinations of identifiers. Common techniques for introducing statistical noise into a dataset are generalisation, such as replacing the name of a country with a continent, and bucketing, which is the conversion of numbers into ranges. In addition, data redaction and pseudonymisation are often used with statistical noise techniques to ensure that no unique combinations of identifiers exist in a dataset. In the following example, data in certain columns is generalised or redacted to prevent re-identification of individual entries.

Adding statistical noise to prevent re-identification.

Data aggregation

When the integrity of raw data doesn’t need to be preserved, journalists can rely on data aggregation as a method for de-identification. Instead of publishing the complete dataset, data can be published in the form of summaries that omit any direct or indirect identifiers. The principal concern with data aggregation is ensuring that the smallest segments of the aggregated data are large enough, so as to not reveal specific individuals. This is particularly relevant when multiple dimensions of aggregated data can be combined, as in the case study below.

Case study: Mozilla’s Facebook Survey

Following the Cambridge Analytica scandal, the Mozilla Foundation conducted a survey of internet users about their attitudes toward Facebook. In addition to their attitudes, respondents were asked for information about their age, country of residence, and digital proficiency. The survey tool also recorded IP addresses of users, as well as other metadata, such as the device used and time of submission.

These responses were made available via an interactive tool, which allowed audiences to closely examine the data, including through the ability to cross-tabulate results by demographic criteria, like age or country. But Mozilla also wanted to release all of the survey data to the public for further analysis, so a careful approach to de-identification was required.

To begin the de-identification process, Mozilla removed all communication metadata that wasn’t required to complete the analysis. For instance, IP addresses of respondents, as well as the time of submissions, were scrubbed from the dataset. The survey didn’t record direct identifiers, such as names or email addresses, so no redaction or pseudonymisation was required. Even though the survey included over 46,000 responses, the data included certain combinations of indirect identifiers, such as country and age information, that allowed users to zoom in on small samples of the respondents. Since this increased the risk of re-identification, all countries with less than 700 respondents were bundled into an ‘other’ category, which added sufficient statistical noise to the data.

Despite these efforts, Mozilla’s privacy and legal teams remained cautious about publishing the data, since its global character implied possible legal liability in various jurisdictions. But, in the end, the value of publishing the data outweighed any remaining privacy concerns.

De-identification workflows for journalists

For journalists on a deadline, de-identification may appear to play second fiddle to more substantive decisions, such as assessing data quality or deciding how to visualise a dataset. But ensuring the privacy of individuals should nevertheless have a firm place in the journalistic process, especially since improper handling of personal data can undermine the very credibility of the piece. Legal liability under privacy laws may also be of concern if the publication is responsible for data collection or processing. Therefore, data journalists should take the following steps to incorporate de-identification into their workflows:

1. Does my dataset include personal information?

It may be the case that the dataset you are working with includes weather data, or publicly available sport statistics, which absolves you from the need to worry about de-identification. In other cases, the presence of names or social security numbers will quickly make any privacy risks apparent. Often, however, determining whether data is personally identifiable may require a closer examination. This is particularly the case when working with leaked data, as explained in this Long Read by Susan McGregor and Alice Brennan. Aside from noting the presence of any direct identifiers, journalists should pay close attention to indirect identifiers, such as IP addresses, job information, and geographical records. As a rule of thumb, any information relating to a person should be considered a privacy risk and processed accordingly.

2. How sensitive and identifiable is the data?

As explained above, personal information carries different risks based on the context in which it exists, including whether it can be combined with other data. This means that journalists need to evaluate two things: 1) how identifiable a piece of data is and 2) how sensitive it is to the privacy of an individual. Ask yourself: Will an individual’s association with the story endanger their safety or reputation? Can the data at hand be combined with other available datasets that may expose an individual’s identity? If so, do the benefits of publishing this data outweigh the associated privacy risks? A case by case approach is required to balance the public interest in publishing with the privacy risks of revealing personal details.

3. How will the data be published?

A journalist writing for a print publication in the pre-internet era didn’t have to worry about how data would be disclosed, as printed charts and statistics do not allow for the further querying of their underlying data. However, at the cutting edge of data journalism today, sophisticated tools and interactive visualisations enable audiences to undertake a detailed examination of the data used in a particular story. For example, many journalists opt for an open source approach, with code and data shared on Github. To open source with privacy in mind, all data needs to be carefully scrubbed of personal information. When it comes to visualisations, some journalists protect privacy by leveraging pre-aggregated data, which obfuscates the original dataset. But it’s important to check whether these aggregated samples exceed a minimum threshold of identifiability.

Fusion uses a visualisation of interrelated entities to illustrate network investigated as part of the Panama Papers, while still maintaining privacy boundaries on the data.

4. Which de-identification technique is right for your data?

Journalists will often need to deploy a combination of de-identification techniques to best suit the data at hand. For direct identifiers, data redactions and pseudonymisation -- if properly implemented -- usually suffice in protecting the privacy of individuals. For indirect identifiers, consider adding some statistical noise by grouping data into buckets or generalising information that may not be vital to the story. Data aggregation is the best option for highly sensitive data, although journalists still have to ensure that there is a broad enough range of data and sufficiently uniform distribution in the aggregated variables to ensure that no personal information is inadvertently exposed.

Leading by example

Once data is available online, there is no possibility for revisions or corrections. Even if you consider that your dataset has been scrubbed of any personal details, there remains a risk that someone, somewhere may combine your data with another source to re-identify individuals, or crack your pseudonymisation algorithm and expose the personal information that it contains. As always, the risks of re-identification will continue to increase with the development of new technologies, such as machine learning and pattern recognition, which enable unanticipated ways of combining and transforming data.

Remember that seemingly impersonal data points may be used for identification purposes when combined with the right data. When Netflix announced its notorious Netflix prize for the best recommendation algorithm, the available data was scrubbed of any personal identifiers. But again, researchers were able to cross-reference personal movie preferences with data from IMDb.com and other online sources to identify individuals in the ‘anonymised’ Netflix dataset.

Despite the limitations of today’s de-identification methods, journalists should always be diligent in their efforts to protect the privacy of individuals. Whether it is by concealing the identity of their source, or by de-identifying personal information behind their stories.

Leading by example, the ICIJ handles vast volumes of personal data with privacy front of mind. When reporting on the Panama Papers, journalists both protected the anonymity of the source of the leak, by using the pseudonym John Doe, and carefully evaluated how to publish the private information within the leaked documents. There is no reason why journalists of all backgrounds can’t take similar steps to strike a balance between privacy and the public interest in their reporting.

And there are many examples of possible fallouts from disclosure of personal data when privacy conscious steps aren’t taken, from personal tragedies following the Ashley Madison leak, or the vast exposure of sensitive data associated with Wikileaks. Data journalists should strive to avoid the same pitfalls and instead promote responsible data practices in their reporting at all times.

For more on privacy and data journalism:

]]>
Data journalism in disaster zones https://datajournalism.com/read/longreads/data-journalism-in-disaster-zones Wed, 18 Sep 2019 12:08:00 +0200 Steve Doig Joshua Mutisya Arun Karki, John Maines, Cristen Tilley, Norman Zafra https://datajournalism.com/read/longreads/data-journalism-in-disaster-zones 305, 8200, 922. At first sight, these numbers appear meaningless. But if we convert them to kilometres per hour, depth in metres, and barometric pressure, they represent the start of three major natural disasters. The speed of winds when Typhoon Haiyan hit the Filipino province of Eastern Samar. The depth of Nepal’s April 2015 Earthquake. And the intensity of Hurricane Andrew, when it made landfall in Florida 23 years prior.

Looking at these numbers another way, they also represent the most powerful storm to ever strike land, the worst natural disaster in Nepal’s recent history, and the United States’ sixth most intense hurricane.

Superlatives aside, the impact of these disasters raise significant challenges for all reporters -- data or not. In a rapidly changing reporting environment, how do you uphold journalistic standards of truth and accuracy? How do you find and verify data? What are the risks and best practices of data reporting, especially when your audience includes victims? And, perhaps even more crucially, how do you physically get through the weather to report?

To find out, we spoke to a global group of seasoned data journalists, all tasked with reporting in the midst of disaster:

  • Steve Doig: Professor at Arizona State University’s Cronkite School of Journalism, sharing a Pulitzer prize for reportage on Hurricane Andrew at the Miami Herald in 1993.
  • Arun Karki: Executive Director and Founder of the Center for Data Journalism Nepal, who reported on the April 2015 Nepal Earthquake.
  • John Maines: Database Editor for the Sun Sentinel in Florida, where he’s reported on the area’s various hurricanes and the 1996 Valujet airplane crash.
  • Joshua Mutisya: Data Journalist at Kenya’s Nation Newsplex, which has experience reporting on the nation’s droughts and food insecurity emergencies.
  • Cristen Tilley: Senior Journalist at the ABC News Interactive Digital Storytelling Team in Queensland, Australia -- an area frequented by floods and cyclones.
  • Norman Zafra: Journalist and documentary maker, who brings experience from multimedia reporting on the Typhoon Haiyan disaster in the Philippines.

How is reporting on a disaster different from other data stories?

Steve Doig, Hurricane Andrew: A key difference for a local reporter is that you and your family may be victims of the disaster yourselves. This means the added pressure of trying to take care of your own situation while at the same time doing your job of reporting on the disaster. In my case, Hurricane Andrew destroyed my house, so I and my wife and our two children had to live in a trailer in our driveway during the months of making the home livable again.

Cristen Tilley, cyclones and floods: If you're in the disaster, access to the basics like internet, electricity, food, and getting staff to work is hard enough. You need to prioritise those. When we were flooded out of our newsroom in the 2011 Brisbane floods, we had people doing shifts from their lounge rooms for a day or two while a makeshift newsroom was set up. Once we had the makeshift newsroom going, we had a core team to take care of the main news coverage, we deployed reporters to hotspots and then a smaller team worked on special coverage/data. If you're trying to work with data but not actually affected by the disaster, then it's all about finding reliable data in time.

Flooding of a Brisbane motorway during the floods illustrates the practical challenges facing reporters during a disaster. Credit: Martin Howard, CC BY 2.0.

Joshua Mutisya, Kenyan emergencies: Unlike other data stories, disaster stories are largely marred by a lack of reliable data to use. The authorities are also reluctant to provide information. Once a disaster occurs, little focus is given to the data-angle of the story as newsrooms are more interested in getting the right multimedia product to share with their audience. However, data journalism is different. It takes time to do a thorough analysis. Therefore, a worthy piece could be successfully completed days after the disaster has occurred.

Can you tell us more about the challenges around data reliability?

Norman Zafra, Typhoon Haiyan: The spread of unreliable information is a persistent feature of disaster events, especially when everyone is in a state of shock. Although natural disasters, such as typhoons, are often characterised by high predictability, there remains an absence of official data, given the enormity of the tasks shouldered by local officials. To overcome this, news organisations in the Philippines employ a dedicated research team that helps journalists coordinate, collect, gather, and verify information available from different sources. Access to this data is via an intranet.

Steve Doig, Hurricane Andrew: Anecdotal information after a disaster certainly can be unreliable -- witness the doctored ‘shark in the subway’ images from recent hurricanes. But much of the data that we would want to use for analysis -- damage surveys, property tax rolls, utility outages, and so on -- comes from official sources. The problem is that it can take weeks for authorities to gather a detailed house-by-house assessment of damage across a widespread disaster area. After Hurricane Andrew, the first damage assessment lists we got were from the Red Cross, but they were useless for analysis -- impressionistic and very general (stuff like ‘lots of roof damage along 125th Street’). But county inspectors began building a true house-level database with addresses and percentages of damage to roofs, windows, and other structural elements. That database, which came to us in pieces over several weeks, was crucial for our analysis.

This fake shark has gone viral during Hurricane Irene in 2011, Hurricane Sandy in 2012, Hurricane Matthew in 2016, and Texas' flash floods in 2015.

Joshua Mutisya, Kenyan emergencies: The greatest challenge is the lack of reliable data. This could be largely attributed to over-reliance on singular sources for information, such as government agencies. To counter this, data journalists should have alternative ways of acquiring data from the ground. This implies media houses also investing in localised data collectors who can provide credible information in time.

What are some other ways that journalists can address unreliable and deficient data?

Cristen Tilley, cyclones and floods: For our coverage of Cyclone Debbie, we found a source of real-time wind speeds and visualised the current wind speeds in the cyclone zone. This was data from the weather bureau so knew it could be trusted. This is different from trying to provide information on a fast-moving, unpredictable bushfire. So the way you account for the risk involved in doing data work is to think about how it could affect the audience/reader, and don't go near coverage that could have adverse effects. One of the other problems we encountered with the cyclone coverage was that we didn't have time to scrape the data, so ended up updating by hand. This was time consuming but we didn't really have a choice.

The ABC’s visualisation of Cyclone Debbie’s wind speeds, using reliable information from the Australian Bureau of Meteorology.

Joshua Mutisya, Kenyan emergencies: Back in 2016 and parts of 2017, Newsplex carried out analysis about drought that had struck the country. To get a clearer picture, we used data from the National Drought Management Authority, which showed that warnings of an impending drought had been given way before. This provided a larger angle to the story instead of just focusing on figures from government about the estimated number of people affected.

Arun Karki, Nepal Earthquake: You can verify or fact-check the data to some extent using some online verification tools. Many hoaxes buzzed around on social media during April 2015 Nepal earthquake, and we were receiving so much information with discrepancies about the deaths and damages from local and individual sources. So we only quoted official sources -- they were slower, but credible.

In addition to data reliability, do disaster scenarios raise any other ethical concerns for data journalists? And how can these be addressed?

Arun Karki, Nepal Earthquake: Personal identifiers in data tables are very sensitive information -- and may pose risks to individuals or specific groups/communities if exposed or published. Two years after Nepal 2015 Earthquake, the Central Bureau of Statistics (CBS) of Nepal published a comprehensive dataset of housing damage, but to protect privacy, they didn’t share some private information of beneficiaries like the ‘geo-location tag’ of each household. Such sensitive information might be even more risky to women and other vulnerable communities.

‘Not fact-checked’ data could also be traumatising to audiences. However, after some time (a few hours or days -- it depends), online or social media crowdsourcing could be a good way to start verifying data, when authentic or real-time datasets are not available from trustworthy sources.

John Maines, Floridia disasters: Just be precise as possible, particularly in live television and radio newscasts. If you report, or interview someone who says “the whole town is leveled” or “the entire city is underwater”, is that really true? Better to report that the water has reached an XX feet depth in YY neighborhood, and is expected to continue rising until it reaches ZZ street. Audiences will still be traumatised, there is no way around that, but details are better than sweeping, non-specific statements. If you don’t know details, just say you’re working hard to find them.

Despite these challenges, how can data be used to improve the way that journalists report on disasters?

John Maines, Floridia disasters: You can use data to map out place that were not impacted by a disaster. We did this years ago. People in our poorer neighborhoods learned that they could get money from our Federal Emergency Management Agency (FEMA) by falsely reporting damage they did not have. People would remove stuff from their homes, spray it with a garden hose, and call FEMA. Hundreds of Millions of dollars in fraud, that taxpayers paid. We were a Pulitzer finalist for those stories, which we reported from several cities around the United States.

Sometimes disaster affected areas aren’t the story.

Steve Doig, Hurricane Andrew: Data can be used to track recovery, too. We used data from the municipal water/sewer utility to get dates when repaired homes were being reoccupied, and used that to do timeline maps showing where recovery was -- and wasn’t -- occuring.

Joshua Mutisya, Kenyan emergencies: One, in the wake of a disaster, journalists get quite emotive, and sometimes end up providing information that is exaggerated. Numbers don’t lie, and so with dependable sources, as data journalists, we don’t fail the audience by providing non-factual information. Two, using data we are able to provide different angles to a story. In the wake of the famine that hit Kenya in 2017, Newsplex covered different aspects; from the overall estimates of people affected, to the worst and least hit counties, to the invasion of destructive army worms in Kenyan farms, which threatened to worsen the situation.

Nation NewsPlex.

Norman Zafra, Typhoon Haiyan: Data visualisation is a powerful form of contextual reporting and is able to bind together pieces of information collected from various sources. To improve the way journalists report on disasters, a centralised database provided by key government agencies and official sources is always useful. After Haiyan, for example, a database of aid funding was made available to the public. It was a rich source of information that allowed opportunities for journalists to interpret and innovate the presentation of complex figures.

Steve Doig’s groundbreaking maps in the Miami Herald.

On data visualisation, what are the best ways to use it in disaster reporting?

Steve Doig, Hurricane Andrew: Maps. Our Andrew coverage was among the first to use GIS mapping to overlay the path of the hurricane winds onto a grid shaded by the percentage of homes that were damaged. But today’s GIS tools and property shapefiles make for fantastically better resolution of such maps, down to the house level, than was possible 25 years ago.

The report’s covershort, illustrating varying levels of damage between houses at the top, which were the most devastated, and those that survived towards the bottom.

I’ll also recommend good aerial imagery. We acquired hundreds of detailed color pictures taken from an altitude of about 600 feet as part of a systematic damage inventory. The cover shot of our major damage analysis report was a single image showing three neighboring subdivisions with wildly differing levels of damage, attributable to varying construction standards. Another good use for a comprehensive collection of pre-disaster imagery is to create before-and-after slide shows.

Norman Zafra, Typhoon Haiyan: Make the data presentation sexy -- perhaps look for patterns, visually highlight the most relevant message and convey meaning that informs rather than complicates. Graphic designers should be part of the team too since designing is usually outside the skill set of most journalists. My suggestion is to tap the interactivity of web and mobile platforms to allow the audience to interact with data rather than treating the web as a repository of static information graphics. An interactive map is useful especially in the Philippines wherein the extent of devastation can only be realised when presented in a map. Personally, I find disaster data reporting extremely challenging -- too much complexity in the presentation can turn off the audience but oversimplification on the other hand can also be misleading.

A timeline interactive, produced by Norman to illustrate the extent of Typhoon Haiyan’s damage in the Philippines.

Since it is difficult to predict when a disaster might strike, what advice do you have for data journalists to prepare themselves?

John Maines, Floridia disasters: You need a storm/disaster kit. Don’t care if you are a data person. Don’t try to put it together when disaster is about to strike, that is stupid. I know that is what we are told even as regular people, but it is true. You need flashlights, a backup charger for your cell, water. Canned food. It is amazing to me how many people go to the supermarket in the hours before a storm hits. Just for water. Go online, order a few of those five gallon foldable plastic things campers use. Then stow them. Before the storm hits, fill them up. No trip to the market! Also, think about things you don’t normally think about. Like mosquito repellent. When Valujet crashed in 1996, I spent the night in the Florida Everglades getting eaten alive by mosquitos. Thank god in the middle of the night the American Red Cross came by with water and repellant. God bless them.

Also, news organisations tend to let batteries die on laptops, because they are plugged in most of the time. Check yours.

Steve Doig, Hurricane Andrew: As for data, gather what you can before disaster hits. Examples would include the property tax roll, which should have detail about ownership and value and age and type of construction for every structure in your area. Also, maybe a detailed historical database of the kinds of disasters which might hit your area, whether they’re hurricanes or earthquakes or wildfires or whatever. A couple of years before Andrew, I did a full-page graphic showing the history of hurricanes in Florida; immediately after Andrew, we dusted it off, added Andrew’s track and republished it.

Norman Zafra, Typhoon Haiyan: In the case of typhoon disasters, journalists must learn from previous data reportage to assess what worked and what didn’t work for the public’s understanding of the news. Evaluation is necessary. To prepare for it, there should be a way to routinise disaster data reporting (e.g. devising a design or content template for emergency data reporting, or a repository of previous data reports). For instance, we can learn from the principle of a ‘dark website’ in crisis communication. A dark website is a hidden (template) page that is activated only when there is a crisis. It’s prepared in case a crisis occurs without warning.

Cristen Tilley, cyclones and flood: We try to think about coverage before the storm season hits. Look at advances or developments in our technology, as well as in the data available, and see if there's anything that can be done differently. It's helpful to have a few go-to sources of reliable data if an unexpected disaster happens too, such as the weather bureau or other emergency authorities. Also, look back at your previous coverage and see if you can tweak anything so you're more prepared. For example, after the experience of updating the wind speed tracker by hand, I wrote a script to get that data automatically next time.

Any final thoughts?

John Maines, Florida disasters: Pace yourself. As reporters, our first instinct when disaster hits is to go to the newsroom. We learned years ago that this was a mistake. In fact, our editor in 1992 when Hurricane Andrew hit said the worst mistake he ever made is insisting that everyone come to the newsroom before the storm, and sit it out there. So, we had a bunch of people in the newsroom with no power, worried about their loved ones and if they survived. Not good. And they all got tired at the same time.

What we do now is have a red team and a blue team. The red team sits out the storm in the newsroom, arriving a day before it hits and bringing sleeping bags. The blue team, which I am part of because data work is delayed, comes in after the storm is gone.

Steve Doig, Hurricane Andrew: In a disaster, your first duty is to your family. After Andrew, it was a couple of days before I could even think about going into the newsroom. I had to make my heavily damaged house into something of a shelter for my wife and young children, including getting the toilets to operate with buckets of water and making sure exposed wiring would be safely out of the way when the electricity came back a couple of weeks later.

Arun Karki, Nepal Earthquake: Don’t hurry. Don’t break headlines unless the data is verified. It’s also good practice to keep all of your datasets (whatever comes in after the incident and from every last source) to use later for your follow-up reporting. Because every dataset could be useful for your future stories. For example, detailed datasets of house damage and housing grant beneficiaries were initially made public by local authorities in Nepal after 2015 earthquake. However, those granular or disintegrated datasets (very helpful to compare/correlate during reconstruction) could not be retrieved or accessed later. So, keep mining for data. Another tip: reach out to different government levels and sources. If data is not available at the national level, it could be at the sub-national, local, or periphery level. Moreover, if you’re not familiar with advanced data skills then find and collaborate with tech-savvy individuals or groups who can help you, from data gathering to visualisation.

For more emergency reporting, check out our:

]]>
Hard data and soft statistics: a guide to critical reporting https://datajournalism.com/read/longreads/hard-data-and-soft-statistics-a-guide-to-critical-reporting Sat, 31 Aug 2019 06:56:00 +0200 Morten Jerven Kate Wilkinson https://datajournalism.com/read/longreads/hard-data-and-soft-statistics-a-guide-to-critical-reporting It is generally believed by the layman, the expert, and the journalist that numbers are hard and judgements are soft. It means that when we see a number, or a statistic, we think of it as objective, accurate, and incontestable. However, when we hear that someone consider, believes, or has an opinion our sceptical mind awakens. But often numbers are far softer than we commonly assume. Basic metrics such as inflation, or debt as a share of GDP, have been shown to change radically after revisions and, at times, they have been revealed to be fraudulent. It turns out that just as we would take one person’s view as an anecdotal observation that needs to be questioned, numbers and statistics should also be subject to serious cross examination.

Morten Jerven is a Professor at the Norwegian University of Life Sciences and a Visiting Professor in Economic History at Lund University. He's published widely on economic development statistics and authored three books, including Poor Numbers: How We Are Misled by African Development Statistics and What to Do about It.

Why and how are statistics used?

There are many reasons why numbers and statistics are so popular in usage, but perhaps the most important one is because they allow us to say something and make decisions about things we know nothing about. An important example for finance journalism comes from Bond Credit Ratings -- where countries are rated ‘Triple A’ or not, which in turn allows investors to decide whether they should put their millions in, say, a Cypriote bond or not.

Without the Triple A rating, the investor would have no signal, no information, and no basis whatsoever to make that investment decision. The Triple A rating allows the investor to sort countries, which they may have no knowledge of, and invest based on this signal of the economy’s credit worthiness. Should Cyprus boast a good rating, for example, investors may buy bonds without knowing whether it is a sovereign state, or whether it is a dependency of Italy, Greece, Turkey, or Great Britain, let alone what its capital and main export is, or the name of the currency. It further allows the investor to change their mind about the investment when the rating changes -- even if this change only reflects the mood of investors, rather than the economy’s stability.

Financial ratings of European states by Standard & Poor's rating on 18 February 2019. Credit: Wikimedia (CC BY-SA 4.0).

The fact that numbers are not always objective reflections, based on accurate readings and observations, is easy to forget. After all, some of the key metaphors we use to report financial numbers are often taken from meteorology, where we do indeed have ways of measuring rainfall, temperatures, and wind. These phenomena do exist, regardless of whether we measure them or not. But that is not true of all things we measure and, for a lot of the numbers that are regularly reported, it is important to be wary that the process of measuring does not in itself constitute the existence of a phenomenon.

Measuring the immeasurable

There is no way of objectively inserting a measurement stick to determine a Triple A rating, and much less is there any way of reading other subjective phenomena. Take corruption, for example. Corruption is one of those things that is undeniably important and on the forefront of what journalists should report on, but there is no way the level of corruption can be objectively gauged. By its nature, corruption takes place in secret and concealed forms, and it’s possible to argue that the extent to which it is acceptable or not depends a lot on the specific situation, the cultural setting, and variations in law.

futureatlas.com on Flickr (CC BY 2.0).

Yet, we want to measure it, and there are many rankings that purport a definitive picture of corruption across countries. To understand the limitations of these indices, let’s take a look at a hypothetical: If you stopped one random European person in the street and asked them “how corrupt is Nigeria?” and the person responded “oh very corrupt, I don’t trust African people at all, they don’t follow rules like we do in Europe”, you should, and I think most journalists would, dismiss this as unreportable prejudice. However, if an organisation, like Transparency International (TI), calls up 100 people in a survey and asks them to rate Nigeria and Sweden on a corruption scale from 1 to 10, this number and the resultant Corruption Perception Index becomes headline news.

These indices are created to be influential and the channel by which they gain most of their influence is through being re-reported by journalists as an accurate reflection of a country’s socio-political circumstances. The truth is: TI’s Corruption Perception Index is just an average subjective perception index, which reflects the prejudices people have about governance in poor countries. Corruption is not observable, and nor are many other social-political phenomena, so reporting on them without reporting on how the data is generated should not pass a journalist’s basic fact-checking tests.

Fact-checking a statistic

Fact-checking a statistic requires more than checking a claim by turning to statistics. In 2014, The Economist wrote a guide on lying with indices, which laid out the dirty tricks that are in use by the compilers of rankings, indices, and other numbers that summarise the world. Journalists can seek help from researchers and institutions that use these numbers in their work, as they normally know their weaknesses, and there is an increasing area of critical scholarship that questions the effects of quantification. Simple first questions that journalists might ask themselves:

  • Is this phenomena objectively observable? There is, for example, a difference between recording rainfall and happiness.
  • Under what conditions was the issue observed? Numbers on rape victims might be very difficult to observe, compared to cars crossing a toll bridge.
  • Who made the observation? Consider whether they would have any reason to present a biased measure.

Just as we would subject a witness statement to cross examination, we should also tear apart a statistic when it is presented to us.

Case study: Fact-checking subjective indices

By Kate Wilkinson, Africa Check

In 2014, Africa Check investigated reports that South Africa’s maths and science education was the ‘worst in the world’.

These regular headlines were based on the findings of the World Economic Forum’s (WEF) Global Information Technology Reports, which rank countries on the quality of their math and science education. In the latest report from 2016, South Africa was again ranked last out of 139 countries.

But the WEF does not conduct standardised tests to assess the quality of maths and science education in the countries surveyed. Rather, the rankings are the result of an ‘executive opinion survey’, where unidentified ‘business leaders’ are asked to rate the quality of maths and science education in their country on a scale from 1 (worst) to 7 (best).

The resulting education rankings are not, in fact, an assessment of the quality of education in South Africa, or any of the other countries. Instead, this subjective index reflects the personal opinions of a small group of unidentified people about a topic in which they are not an expert.

In fact-checking this statistic, we spoke to leading education experts in South Africa. They were able to critique the ranking and point us to the most recent and reliable education rankings.

For instance, Martin Gustafsson, an economics researcher at the University of Stellenbosch, told us: “There is valuable data in the [WEF] report. For things like business confidence it is useful. But you can’t apply opinions to things like education. It is like asking business experts what they think the HIV rate is”.

For journalists reporting on education levels, looking into other sources provides a more comprehensive view of the country’s performance. For example, standardised, cross national testing reveals that South Africa does have problems with its maths education, but it performs better than a number of countries. In 2007, the Southern and Eastern Africa Consortium for Monitoring Educational Quality ranked South Africa eighth out of fifteen countries for it’s math performance. Mozambique, Uganda, Lesotho, Namibia, Malawi and Zambia all performed worse.

Remember: Make sure to check whether a ranking is measuring the actual phenomena or simply what people think about it. Always look for additional sources and expert help to corroborate or contextualise a subjective index.

The impact of statistical reporting

Uncritical reporting on subjective indices does, of course, have real consequences. The more frequently soft statistics are re-reported, the more likely we’ll bring them into the fold of our arsenal of hard statistics, forgetting that ‘corruption’ is not a quantity that can be easily gauged. There are consequences for pretending that you can measure something that you cannot -- as it may seriously mislead us to think we know something we do not -- and, in the end, these can translate into very bad decisions.

One of these bad decisions was made in 2015, when the UK’s Financial Conduct Authority penalised the Bank of Beirut for failing to establish sufficient controls against money laundering and other financial crimes. As part of this decision, the Authority banned the Bank from taking on customers in ‘high-risk’ corruption jurisdictions. But there is no list of countries that are corrupt or not, or an objective measurement, so they reached for the second best thing: TI’s Corruption Perception Index.

As commentators pointed out at the time, there is a close to perfect correlation in a country’s GDP per capita and its ranking on the Corruption Perception Index. So, in effect, low income countries were not getting the banking services of high income countries, simply because the people surveyed by TI suspected that these lower income countries are more corrupt.

The Center for Global Development mapped the countries which the FCA considered to be ‘high risk’, that is, any ranking 60 or below on the Index.

Corruption reportage is something that we need to our focus attention on, along with other similar ‘unknowns’ in the list of global problems, like trafficking, drugs, and illicit finance. Increasingly, organisations that are involved in promoting political traction on these issues find themselves involved in a numbers game. There is always going to be an incentive to highlight the bigness of a problem to help further the cause. And there is no reason to doubt that TI, or other organisations such as Walk Free, which publishes a Global Slavery Index, are not aware of the flawed nature of the data. But they make a calculated risk where they hope that the upside, in terms of more influence, is bigger than the downsides of mismeasuring.

To combat superficial reporting, and avoid exacerbating its impacts, journalists need to dig deeper and question clickbait indicators. Strategies include:

  • Contacting researchers and experts in the field -- they’ll be able to comment on the usefulness of subjective indicators, as well as any limitations
  • Looking beyond numbers, by producing investigative stories that look for information which isn’t considered or communicated by an indicator
  • Asking yourself how the story would hold if you used a different measure or a different data source to frame the narrative.

Case study: Reporting on corruption

By Kate Wilkinson, Africa Check

How much money has South Africa lost due to corruption since democracy started in 1994? A popular and widely shared estimate is R700 billion. This figure has been published by newspapers and tweeted by a prominent trade union leader -- but, as Africa Check’s 2015 fact-check revealed, it’s a thumbsuck.

Underpinning these reports, a civil society handbook claimed that ‘damages from corruption’ is usually estimated at ‘between 10% and 25%, and in some cases as high as 40 to 50%’ of the country’s public procurement contracts. But no source was provided to substantiate these estimates.

As years passed, the claim changed. It was then reported that around 20% of the country’s gross domestic product -- not procurement contracts -- was lost every year to corruption.

So, how much has corruption cost South Africa? The frustrating -– and logical –- answer is that we just can’t say for sure.

The country’s treasury has not attempted to calculate an estimate. And while governance experts agree that a large amount of money has been lost, they won't be drawn on an exact number.

With little information available, journalists should be wary of definitive reports on national corruption levels. Instead, it may be possible to piece together a picture of corruption in a country by bringing together different sources.

National surveys are one resource that can be used to shed light on people's experience of bribery and corruption. For example, in 2016 nearly a third (32.3%) of adults in Nigeria reported paying bribes to a public official or said that they were asked to in the year before. But, as always, all surveys should be interrogated and corroborated with country-level experts.

Trust and mistrust in official numbers

So far, we’ve looked at how numbers tend to be reported as hard facts, and now we’ll move onto its exception: when statistical reporting is linked to foul play. Perhaps a country is accused of skewing incomes to receive aid, or another country is seen to downplay the social impacts of a political crisis; on the other end of the spectrum, overly critical reporting of statistic can also be problematic. The fact that the social world is complex and difficult to understand means that even quantitative phenomena don’t lend themselves to cheap and fast summaries that are easily reportable. To illustrate this point, let’s look at something that should be relatively easy to count – namely money. But turns out that numbers are soft here too.

Measuring economic activity is hard at the best of times, let alone when faced with the challenges presented in low income countries. Credit: Wikimedia.

On 5 November 2010, Ghana’s Statistical Services announced new and revised GDP estimates. Overnight, the size of the economy was adjusted upward by over 60%, which meant that previous GDP estimates had missed about US$13 billion worth of economic activity. While this change in GDP was exceptionally large, it turned out to be far from an isolated case.

In 2012, I wrote a summary of this situation, explaining in layman terms how one country, like Ghana, could go from being so poor one day to an aspiring middle-income one the next. My intent with the piece was to demystify the process and to lay bare the basic discrepancies between global standards of measurement and local challenges of data availability and resources. The simple fact is: It is very demanding and costly to measure a country’s whole economy, particularly in low income countries where very few businesses and individuals are reporting taxes, and only a minority of economic transactions are recorded, with most taking place in the informal, unrecorded economy. Yet, despite these nuances, when the Guardian reprinted the story, they smacked the headline: Lies, damn lies and GDP on it. As anyone who has read that piece, or my book will know, I go to some lengths to dispel the beliefs that there was a hidden political agenda behind this revision.

While critical skills often fail when it comes to thinking about converting complex realities into simple numbers, data users also manage to maintain an inherent criticism of any ‘official number’, based on simply distrusting some states and trusting others. A similarly misguided gut reaction exists among scholars, who may never trust a number from Sudan, Ghana or South Africa, but would not hesitate to use the same number if the World Bank recycled it.

Case study: Promoting trust through fact-checking national statistics

By Kate Wilkinson, Africa Check

In 2018, Africa Check investigated claims by US-based news website Quartz that much of Kenya’s borrowing in recent years has originated from China. Further, they reported that the country’s obligations to Beijing run ‘much deeper than many ordinary Kenyans realise’, under the headline: ‘China now owns more than 70% of Kenya’s external debt’.

The size of Kenya’s public debt was a campaign issue in the country’s 2017 elections and this debt is sometimes conflated with China’s role in Kenya. The Asian country financed the standard gauge railway project, the most visible of the government’s economic growth projects. Yet, its full cost remains unclear because the agreement is closed.

Inaccurate reporting on the issue breeds mistrust in the Kenyan government and its official treasury data.

The first step in fact-checking this claim was to find out what information it was based on. Quartz said that the source of the information was an article in the Nairobi-based Business Daily newspaper. That article had relied on information from the 2018/19 Kenyan budget statement which showed that Kenya owed China KSh534.1 billion. This, they said, was 72% of the country’s total bilateral debt of KSh741 billion.

But while it may be true for bilateral debt, it’s not true for all of Kenya’s foreign debt.

Next, Africa Check sought out experts to explain and unpack the numbers.

“All bilateral debt [is] external debt but not all external debt [is] bilateral debt,” said Odongo Kodongo, a financial economist and associate professor at Wits University’s business school.

External debt is the total public and private debt that a country owes foreign creditors. It includes multilateral debt, commercial debt, bilateral debt, and guaranteed debt.

Experts used Kenyan treasury documents to estimate that China’s share of Kenya’s external debt was KSh534.07 billion of KSh2.51 trillion as of 31 March 2018. This was equal to 21.3% -- not 70%.

China’s role in Kenya is a controversial issue. This fact-check showed the public that official government data could be relied upon to determine if claims about the countries’ relations were true.

The failure to reflect on the softness of statistics in general, and rather look to a higher authority, was also reflected in an episode a few years later. On 7 April 2014, the Nigerian Bureau of Statistics (NBS) declared that their GDP estimates were also being revised upward to $510 billion, an 89% increase from the old estimate. It was controversial: The GDP revision was bigger than expected, and, the IMF’s Statistics Department had provided technical assistance for the revision, which they viewed as incomplete. But the IMF has no authority to endorse or not endorse statistics, so they couldn’t stand in the way when Nigeria wanted to release the new numbers.

GDP information is released by the NBS in quarterly reports. This infographic from the Q4 and 2018 full year report shows a substantial increase in the 2014 growth rate.

When Yemi Kale, the director of NBS, announced the new numbers, they were keen to make sure that the credibility of the statistics was not undermined, well aware that the international media’s trust was low. This meant that the IMF Mission Chief to Nigeria was well prepared for the inevitable question as to whether they ‘endorsed’ the new numbers. When asked, he made a long statement that ended by summarising how the IMF supported “the efforts being made to improve the statistics of the Federation, as a basis of sound decision making. Let me state that we endorse this wholeheartedly and will support Nigeria in this regard”. In the end, this statement became reported as ‘the IMF endorsed the numbers’, despite the broader messages of statistical capacity building that it contained.

So, what would be a more conscientious way to report on this number? Rather than focusing on narratives of trust or distrust around the revision itself, possible stories could’ve looked at:

  • The challenges of measuring economic activity in low income countries
  • Resourcing limitations at the NBS, which meant benchmark data for the GDP hadn’t been updated properly for more than a quarter of a century
  • The role of the IMF, and its inability to endorse a country’s data.

Conclusion

By digging deeper, and looking beyond a simplistic numbers, journalists are able to provide more accurate and comprehensive reports on social issues. When confronted with a number on corruption, unemployment, or the size of the illegal economy, journalists should always think critically about where that number came from. Numbers are much softer than we would like to think, and we trust and mistrust them far more than justified by their accuracy.

]]>
7 countries, 9 teachers: a dossier of data journalism teaching strategies https://datajournalism.com/read/longreads/a-dossier-of-data-journalism-teaching-strategies Thu, 01 Aug 2019 12:15:00 +0200 Nouha Belaid Anastasia Valeeva Bahareh Heravi Jeff Kelly Lowenstein Kayt Davies Roselyn Du, Adrian Pino, Eduard Martín Borregon, Soledad Arreguez https://datajournalism.com/read/longreads/a-dossier-of-data-journalism-teaching-strategies It doesn’t matter which country you’re in, or what university you visit, there’s a common refrain that you’ll hear in the halls of J-schools across the globe: “I’m not good at math”.

Of course, this aversion often leads to a self-fulfilling prophecy. If students don’t think they’re good at numbers, they avoid them altogether. Yet, to find and accurately report on stories in today’s data intensive society, journalists need these skills.

This calls for a reconsideration of journalism programmes around the world. Teachers need to ask themselves: What are the most relevant strategies to equip students with the skills required for finding facts in datasets? To understand them? To scrutinise them? And to communicate them to the public in the most appropriate and understandable manner?

There’s a lot to be learnt from teachers of all backgrounds. So, in this Long Read, we’ve curated a dossier of reflections from nine educators in seven countries to start understanding the most effective ways to introduce students to data.

Dr Kayt Davies, Edith Cowan University

Awareness that data journalism is a serious and valuable part of contemporary journalism has well and truly dawned in Australia’s higher education sector. What follows is the slower induction of the key concepts into the curriculum of everyday journalism education. And this is no small thing.

Like many journalism educators, I was initially employed by a university because I had worked as a journalist for many years and can bring real world experiences into the classroom. But data journalism presents a bundle of new challenges. I don’t have lived experience of being a data journalist to draw on, and neither do my colleagues.

My way-forward was research. I wrote an academic paper that involved interviewing 35 other Australian journalism academics, at 25 universities, about what they were doing to address the problem. In brief, I found that I was not alone. The others agreed (almost unanimously) that upskilling was an issue. Very few could access funding to bring in experts to teach students, or support staff. Most were trying to cram their own upskilling into their already overflowing workloads and puzzling over what specifically they should dedicate their limited upskilling time to learning. Should it be Tableau, or more Excel, or coding or testing data scrapers, or data visualisation tools?

The next ugliest problem was squeezing it into crammed curricula, already bursting at the seams because of pressure to produce graduates who can work on every platform.

Then there are the students themselves. They present with a mix of skill levels, which is tricky enough, but in this case the mix is due largely to a fascinating phenomenon called ‘math aversion’.

In my own data journalism course, I tackle this issue in my first class. We talk about it, confess to having it and make a pact to deal with it as if we are in recovery mode, feeling the fear and doing it anyway. I took heed of what others in my study said, in comments like: “I map everything step-by-step, so they don’t get frightened” and “we have to find workarounds like online percentage converters”. It’s an approach that, according to the student feedback at the end of semester, is working -- although it means that our exploration of the intricacies of cutting edge data journalism is minimal for now. But we are laying the groundwork, and by tackling the fears we are setting people up for lifetimes of learning.

Other solutions discussed in that paper included using blended learning approaches, such as requiring students to complete Excel Courses, and recruiting specialist trainers to teach students, or to upskill staff.

Since that paper, other researchers have documented additional strategies to build the data capacity of students. For example, the University of New South Wales introduced a data interrogation activity into a first year unit, because thinking about data “from the beginning reinforces for students that it is a core skill, just like interviewing or checking facts”.

The University of Canberra also introduced a digital campaigning unit that is relevant to students in a range communications fields, and includes quantitative literacy and data visualisation skills useful for all students mastering data journalism. In describing the program, Glen Fuller explained that the context of ‘changing the world’ via campaigning injected passion into what students could otherwise frame as dull work.

At Swinburne University, they’ve taken a silo breaking approach and employed a data analytics expert to teach data journalism to students, having considered the difficulties of upskilling existing journalism teaching staff.

All of these approaches are experiments and the results of them are being shared at conferences, in academic journals (such as this special edition of Asia Pacific Media Educator), and in books like News, Numbers and Public Opinion in a Data Driven World. Progress in academia can be slow (because everyone is busy) but this ball is rolling now and a revolution in how we think about the role of data in journalism education is underway.

Nouha Belaid, Central University of Tunisia

All over the world, and even in Tunisia, media landscapes have developed to the point that the job market today requires technical skills, as well as linguistic skills, to carry out real journalistic work.

In 2017, through an online survey we conducted with Tunisian journalists, working across newspapers, radio, and television, we demonstrated the value of technical skills for new graduates. Unsurprisingly, our survey found that 33% of journalists identified writing skills as important -- yet, almost the same number, at 32%, highlighted the importance of technical skills.

In Tunisia, discussions around the teaching of journalism have often focused on the linguistic practice of journalism. The technical dimension, which has disrupted the job market, has not been fully considered, although some universities have added standalone modules into their journalism education programs. But we still yet to see any in-depth change.

Even so, these last seven years, we’ve seen more conversations about open data, open source, big data, data mining, data warehouses, and the potential for their use in journalism. In academia, we have really started talking about databases in Tunisia, particularly in the fields of computer science or statistics. This means that most of the movement has been in engineering schools -- two universities offer a Masters in Big Data -- and, while there is one public school in journalism and three private ones, they neither offer a degree in data journalism or a separate subject on it.

As a result, the teaching of data journalism is always by individual initiative, on the part of a professor who usually teaches web writing or editing, online journalism, or similar. For my own part, this was how I introduced my third year web journalism students to the field. Here, my students learnt how to change data into graphs:

Work produced by Nouha’s web journalism students.

There has also been more movement, following the Arab Spring, when academic staff at the Higher Institute of Arts and Multimedia of Manouba, launched a Masters in Media Engineering. This degree has led to three specialties: web development, 3D visualisation, and community management. Students learn how to develop data stories, to manage data, and even to write, by taking courses in data visualisation, data information, and more generalist courses, such as online editing, which include data skills.

Despite these developments, for many journalism schools, integrating specialised coursework in data and computation presents something of a chicken-and-egg conundrum. Without a data education, there’s few educators to teach data. To help, I’ve developed the following steps on bringing data into a curriculum:

  1. While schools may wish to prepare their graduates for this emerging field, the field itself may not yet have enough teachers in its ranks. So, try scheduling guest lectures as a transitional solution.
  2. New tools are developing quickly, and it is critical for the faculty to continue to grow, learn, and change as the field itself develops. Everyone has to follow the new wave.
  3. Many universities provide computer labs and studios for classes. The primary advantage is the certainty that each student will have a workstation with the necessary technical specifications and software installed. The primary disadvantage is that students may graduate without these tools at home, which they need to practice the skills that they have learnt.
  4. Using MOOCs in a complementary fashion with university data journalism courses could help professors integrate new skills into their offerings.
  5. Journalism schools should build collaborative partnerships with other disciplines.

Journalism is not a narrow set of traditional newsroom skills, but instead encompasses whatever tools and methods have, in one way or another, been made journalistic.

Several journalism schools around the world have begun building bridges with computer science departments by opening research centers, co-teaching and cross-listing classes, and even developing joint degree programs. This should be the way forward in Tunisia, where expertise is siloed in our journalism schools and multimedia institute. Instead, let’s gather experiences and build a comprehensive approach to teaching the country’s future data journalists.

Dr Roselyn Du, Hong Kong Baptist University

New digital possibilities, such as the plethora of online information, the popularity of social media, and the circulation of do-it-yourself news, have threatened professional journalism’s monopoly on information. In an age when almost every ordinary member of the public can use a smartphone to take photos, videos, and publish news to social media platforms, data journalism is one crucial way to make professional journalists distinguishable.

In Hong Kong, which boasts the most diversified, competitive, and democratic media market in the Asia-Pacific region, skills in data driven reporting are highly valued by middle-level journalists. Irene Jay Liu, one of Hong Kong’s public faces for data journalism, noted in a 2013 interview with the Pacific Media Center (she was then news editor for data at the Thomson Reuters Hong Kong division) that data journalism might be a solution to the crisis in journalism, including layoffs in the newsroom, closure of local and regional newspapers, and decline in circulations brought about by media convergence. While the digital revolution has made it possible for the ordinary public to find, produce, and distribute information they previously relied on journalists for, data journalism, with its presumed sophistication, can be regarded as one crucial saving grace for professional journalism and the mainstream media to justify their continued existence.

However, professional data journalists are still rare in Hong Kong media industry. This may be due to the fact that journalism programs here have a long history of predominantly focusing on conventional skills sets (writing, interviewing, news judgement) and overlooking the necessity of equipping journalism students with data and computational skills.

While newsrooms are making tremendous efforts in using data to produce news, data training in the higher education sector has lagged behind. Whether journalism programs should emphasise data skills in training future journalists has long been a subject for debate. Adding to this, teaching journalism students (who are generally afraid of numbers) data skills and convincing them to build a data frame of mind has proven to be a thorny task. At times, it is a big challenge, if not a battle.

Roselyn’s class at Hong Kong Baptist University.

To find out more, we conducted a study of 121 journalism students at Hong Kong Baptist University, which has a top journalism program in Asia. Our survey, combined with in-depth interviews, generated three major findings for journalism educators and future journalists:

  1. While journalism students are eager to understand what data journalism is and how it’s practiced, they do not have comprehensive knowledge of data collection, data analysis, and interpretation.
  2. Computational tools are absent from current journalism curricula, which leads to students’ misperception about data usage in news reporting.
  3. While students have a high willingness for learning data journalism, about half of those surveyed expressed a dislike of data work.

Interestingly, we found that gender and major also played a role in our student’s perception of data journalism. Male students mastered more data-related knowledge than their female counterparts. Those majoring in Chinese journalism showed the least interest in data, compared to our financial journalism majors, who seemed much more comfortable with it -- among the Chinese journalism majors, only 20% identified themselves as having an interest in data, compared to 75% of their financial journalism counterparts. International journalism majors lay in between.

The overall irony emerged from our study is: although our students recognised that knowing about data journalism is a must for their career development and will be advantageous in their professional practice, many showed minimal interests or possess minimal data skills. The reason for this gap lies in the lack of support and insufficient training in data by journalism educators. Consequently, it is hard for students to embrace a comprehensive data skill set, not to mention its integration into journalistic core values. To this end, the foremost priority for journalism programs should be to cultivate a data-friendly culture, that is, for the administration to see data literacy and data skills as an area for investment rather than a cost. Only from there can a data-savvy curriculum be situated for our future journalists.

Adrian Pino, Universidad de Concepción del Uruguay

Eduard Martín Borregon, Universidad Iberoamericana de México

Soledad Arreguez, National University of Lomas de Zamora

Although data journalism has been progressively implemented in the study programs of North American and European universities, programs in Latin America have developed slowly, with little presence even at this time.

That said, data journalism training does have some presence in the region’s public and private universities -- for example, Adrián Pino’s program at the University of Concepción del Uruguay (Argentina), where he has incorporated a specific journalism unit dedicated to data. In this unit, Adrián focuses on critical skills, including how to use basic scraping tools, harnessing spreadsheets for processing and analysing data, and foundational concepts and tools for data visualisation. Each student also prepares their own data journalism project, independently carrying out all stages of the story production process until it’s published.

Yet, as with many students around the world, embracing calculations and numbers is one of the central challenges for journalism students in Argentina. Often, they find it difficult to focus on statistical rigor, to incorporate basic mathematics, and to understand that data journalism requires these skills. As educators, we’ve found that the best way to counter this hesitancy is to focus on teaching strategies that foster enthusiasm. For example, when going through the entire process of developing a journalistic project based on data, students have shown greater enthusiasm by taking ownership of the task. Similarly, showcasing the power of data visualisation has lead them to investigate visualisation tools outside of class.

In addition to standalone university programs, a broader initiative by Datos Concepcion has sought to spread data journalism in higher education institutions across the country. Launched in 2017, the Data Journalism Training Program for Universities began with seven universities and added another four in 2018. The program works as an intensive two day workshop (10 hours total) where the basics of data journalism are presented to students through practical exercises that cover data downloading, cleaning, processing, and analysis. It also culminates with the generation of a data journalism project, which again has helped participants connect with the team, become enthusiastic about data, and even realise their first data driven news project.

The Datos Concepcion program.

But it is difficult to measure the success of these programs because there have been no studies looking at the status of data journalism education in Latin American universities, or even in the media teams of our region. The absence of impact measurements, which show the capacity for change and the real impact of data journalism teachings, presents a limitation for educators. Moving forward, we believe that developing new impact indicators for data journalism and surveying the state of its instruction would be a great way to develop effective teaching strategies tailored to Latin American students.

For example, Latin America, known for its booming open data scene, should be the perfect staging ground for data journalism education. Many of those who teach data journalism in the region’s universities participate in the open data movement and many of the organisations that push open data and transparency processes have built trainings that could be better used by university programs. Yet, there are questions around how to better leverage these alliances and take advantage of capabilities on both sides.

Although we have academics and researchers who could advance this area, a lack of financing is generally the factor that delays these opportunities. Without these advances, it is clear that challenges will await us in the years to come, requiring articulated and shared efforts among the educational organisations across Latin America.

Jeff Kelly Lowenstein, Grand Valley State University

Teaching data journalism integrates two of the deepest passions in my life: teaching and working with data.

I learnt to teach from my fourth grade teacher during a two-year apprenticeship that began after my parents were involved in a near-fatal car accident in the mid-80s. My passion for data followed almost twenty years later, after taking a class with Mike Berens, a future Pulitzer Prize winner, at Northwestern University’s Medill School in the spring of 2003. Mike showed us the power data analysis has to generate findings and illuminate systemic patterns of abuse -- lessons I’ve applied in my own work and teaching.

I’ve taught students, interns, and professionals across South America, New Zealand, South Africa, and the United States for anywhere between a single hour-long session to a semester-long course. It’s hard work, and I’ve not gained full mastery. While the daily work of teaching is an enormously joyful experience and I’m grateful for my students’ efforts and progress, I often arrive at the end of the semester feeling drained, and slightly dissatisfied. There is always a gap between what I know is possible and how far the students have gone, and I’m all too aware of the many ways I could have taught better. At the university level, where I’ve worked for the past five years, I’ve wrestled with the challenge of how to teach my students skills, integrate them into the global community of people doing data journalism, and carry out a meaningful project in 15 weeks, with all of the competing challenges of work, other classes, internships, family, and social life.

At the same time, thinking back to my fourth grade teacher and hearing from students I taught more than two decades ago, helps me remember that any teaching is an entry point into a subject’s universal principles and their long-term connection with them. The spark that I have lit with these students helps me keep going through the inevitable struggles I face with the current group. I also draw solace from the knowledge that I’ll have another chance in the following semester, as well as from the humbling experience that we never know what, despite our greatest efforts, our students will take from their time with us.

Even with this uncertainty and continual desire to improve, I’ve developed certain approaches that guide me through each semester of teaching data journalism.

Here are six learnings that ground me and my students:

1. Investigative Reporters and Editors (IRE) Contest entries are your friend

Reading these entries teaches students how to reverse engineer an investigation, albeit with some chest thumping and the larger objective of winning a coveted IRE award. By examining them, and then reaching out to their respective reporters, I’ve found that students start to both learn how an investigation is done and to join the global community of people doing that work.

2. Build incrementally

It helps to start with smaller stories and projects that become the basis for larger, more ambitious ones. I taught a class one semester where I had the students go through the process twice. The first story was with a dataset I provided, while the second was with either an additional one related to the first, or one that they had selected. Building muscle memory through repetition and leading into increased complexity is a positive way to go.

3. Emphasise the importance of failure

It’s hard to overstate how critical it is for students to feel comfortable embracing failure as a necessary part of growth, particularly in data journalism. Often, we not only learn more from our mistakes, but this is where gains come as we strive and reach for different elements.

This has happened to me numerous times, but one that comes to mind is a lesson on the importance of the margin of error. The day before a major public presentation, I learnt that, during the fact-checking process, an analysis that I thought showed major disparities between neighborhoods turned out to be completely within the margin of error. My editor and I scrapped the entire analysis, regrouped over some Chinese food, and launched into a frenzied effort to come up with a new angle. We eventually found one, but only after staying at the office until the early hours of the morning. You can be assured that I check the margin of error every time now!

4. Provide a lot of opportunities for students to get comfortable as a first step towards developing expertise

Some students are very comfortable with digging into and splashing around in the sea of data. Others like to do visualisations. Others feel maps are their thing. Still others excel at moving from working with data to taking their findings into the real world. Offering all kinds of learning options provides more opportunities for students to find their groove and to understand how different storytelling pieces connect to each other.

5. Encourage students to be patient and enjoy the hunt

Going ‘down the rabbit hole’ is a phrase that often has negative connotations of losing the point and wasting time. While that perspective has some merit, there is value, too, in thinking about and exploring all kinds of questions before deciding on a specific course.

6. Remember that the data is never the story

I once said to my wife, “Honey, it’s Saturday night. I’m here with you. I’ve got a tasty glass of wine and a fresh new dataset. Can life get any better?” (“I certainly hope so,” she answered without a second’s hesitation.) As much as I love data, I’ve come to understand more and more that many of the most powerful stories come from weaving together findings with stories of individuals and communities whose experiences reflect the impact of a policy or phenomena.

Although arduous and at times frustrating, working with data and teaching data journalism are tremendously rewarding parts of my life. The ability to make a real impact through the data driven illumination of abuse and oppression is a deeply meaningful experience; to work with young people as they begin their journeys to do the same is even more so.

Anastasia Valeeva, American University of Central Asia

Kyrgyzstan is often praised in the international media as the only democracy in Central Asia. Despite being a developing country, where bride kidnapping is a phenomena and equal access to education and healthcare are yet to be achieved, it is indeed outstanding in terms of media pluralism and freedom of expression. Digital-wise, though, few media in the region can live up to international standards, although quite a few organisations are focused on building their capacity.

It’s into this environment that we first introduced Kyrgyz students to data reporting, with the help of UNDP at the Data Journalism Summer Institute in 2017. Following the success of this programme, I decided to stay on to build a formal classroom framework at the American University of Central Asia in Bishkek.

Although ambitious, it was decided with the department leadership to divide the necessary theory and practice into five courses and offer them continuously at the Bachelor level over the course of 2.5 years -- from an introduction to data journalism to data analysis, storytelling, and visualisation. We wanted to promote a holistic understanding of the various processes and skills involved in the production of solid data journalism work.

This change turned out to be easier done in practice, than implemented on paper. Even though I was invited to lead the university’s data journalism concentration, some of our classes sat under pre-existing courses such as ‘Newswriting’ and ‘Advanced Reporting’. Others, like ‘Data Storytelling and Data Visualisation’, were offered as electives and most of our students were eager to finish the concentration. From my limited observation of two student groups so far, data journalism is not perceived as rocket science but rather as something cool. Even pivot tables -- sometimes, especially the pivot tables.

Anastasia with her students at the American University of Central Asia.

But, in developing the data journalism concentration, we tried to change the set of required courses in the journalism schedule, which presented a unique challenge: these conflicted with the State’s standards for journalism education. And so, because it is becoming clear that a knowledge of basic Excel is as much of a must for journalists as grasping basic grammar and document editing software, we have to offer many of our data journalism courses as electives to complement the State-approved curriculum. By allowing this combination of required and elective courses, we hope that every future journalist in Kyrgyzstan will be at least data literate, with the opportunity to learn more advanced skills.

This is how data journalism courses can spread from the American University of Central Asia, which is often a pioneer and innovator in the country’s higher education system, to other journalism schools, bounded by limited budgets and the skills of their staff. Yet, to facilitate this growth, we need proper research into how journalism education itself should be updated to equip future graduates with the skills for the market in Kyrgyzstan.

In addition to building a comprehensive course, I felt the need to look beyond our cosy classroom and consider the broader data journalism environment. Where will my students publish their first data journalism projects? Who will have vacancies for them once they graduate? Are there enough designers, developers, and other professionals who understand the principles of data storytelling and would be able to work with them on their projects?

With this in mind, I started working in the other direction, too. Together with the alumna of that 2017 Data Journalism Summer Institute, we co-founded the School of Data Kyrgyzstan, a chapter of the global network, aimed at promoting data literacy across Central Asia. To truly prepare students, we run hackathons, where students work alongside journalists, designers, and developers on a data story; offer internships and jobs to help students build experiences; and, we provide them with the opportunity to step up as assistant trainers at our workshops.

We’ve also taken some steps to make data journalism education more available and accessible to other students. For example, we created an online course on data communication, available for free here in both the Russian and Kyrgyz languages. This is already being used by the teaching staff at Osh State University as a blended learning component. And, to really help spread data expertise, we’ve also planned for a two-week retraining programme for journalism professors from all over the country. Thanks to various grants, this is entirely possible to implement and its impact on Kyrgyz data journalism is only a question of time.

To sum up, two years into teaching data journalism at the university-level in Kyrgyzstan, and there are still lots of things to be improved. We are still yet to update the State’s standards and the required courses for university-level data journalism -- which will be a long process. However, our experience has shown that this challenge can be overcome by revisiting the design of the concentration and seeking better cooperation with other departments such as sociology, statistics, or IT; involving actual journalists in the learning process and in the format of production labs; or, even experimenting with the spectrum of formal and informal education to find an ideal balance. So, while I’ve been working to teach data to my Kyrgyz journalism students, my work the university has also taught me to accept things in transition.

Bahareh Heravi, University College Dublin

Ireland is a small country, with a thriving scene of tech and data startups, as well as being the home to the European (or EMEA) headquarters of large international data-centric organisations, such as Google and Facebook. Despite this strong data scene in Ireland, and the prevalence of data analytics skills in the country, Ireland has had a slow uptake of data journalism, particularly in comparison to some other European countries, like the UK or Germany. That said, over the past five years, there has been an increased interest in embracing data journalism in Irish newsrooms. Yet, without a historical demand for data journalism, newsrooms have only had a small number of data-savvy journalists to pick from.

As my personal attempt to try to remedy this, I started Ireland’s first postgraduate data journalism module for the Journalism MA programme at the National University of Ireland in Galway in September 2015. Later that year, I was asked to teach a Dublin City University undergraduate module on Data Journalism too.

In 2016, I joined the School of Information and Communication Studies at University College Dublin (UCD), which gave me the opportunity to start Ireland’s first dedicated data journalism programme. Gearing up to design this new programme, I -- of course -- approached the problem as most academics would: by studying the field.

I first studied the state of data journalism practice and data journalism educational needs globally through a survey I ran in collaboration with Mirko Lorenz of Datawrapper. I then studied and analysed around 220 data journalism related courses I could find across the globe. Following these studies and a feasibility assessment locally, we decided that a part time postgraduate certification programme aimed at professional journalists and journalism graduates was our best way forward at UCD, and in Ireland. Consequently, we designed and launched the UCD Data Journalism ProfCert programme for a September 2017 intake.

As part of UCD programme, our students are introduced to a variety of data journalism techniques and the tools needed to complete a data journalism project lifecycle, as well as being trained in quantitative data analysis, statistics, and R. In the second semester, we run a data journalism studio, which takes students through the production of data driven stories, maintained on a fully-fledged data journalism publication website.

UCD students publish their stories on this dedicated website.

One of the key challenges, for any journalism educator, is to prepare students for real-life storytelling. So, through the data journalism studio, we partner with Irish news organisations for co-publications of a selection of our students’ data stories. Examples of such co-publications are RTÉ’s investigations on ‘Personal Injury Claims in Ireland’ [UCD version, RTÉ version], and ‘Most dangerous cities for Gardai’ [UCD version, RTÉ version], as well as the Irish Independent stories on ‘Hottest travel destinations for Irish sun-seekers’ [UCD version, Irish Independent version] and ‘Patients on trolleys in Irish hospitals’ [UCD version, Irish Independent version]. These co-publication initiatives have paved a way for exciting collaborative opportunities between the programme and news media organisations, while adding a real, tangible industry-focused aspect to the programme.

Since 2014, I have trained around 100 data journalists in Ireland. Despite this figure, and the good quality of work produced by my students, positions hiring specifically for ‘data journalists’ are still very rare to come by. My hope is to train as many data journalists as possible -- and, as the number of skilled graduates grows, that they themselves will change and re-shape the data journalism landscape in Ireland, and beyond.

Are you a teacher? Share your thoughts in our comments below.

]]>
Let’s get physical: how to represent data through touch https://datajournalism.com/read/longreads/lets-get-physical-how-to-represent-data-through-touch Tue, 16 Jul 2019 00:30:00 +0200 Alice Corona https://datajournalism.com/read/longreads/lets-get-physical-how-to-represent-data-through-touch Data visualisation has become a natural companion for journalists reporting complex data stories in both print and in digital formats. But visual cues are just one of many possible ways to encode data, and humans have been embedding data into the properties of physical objects for millennia (think of the Peruvian quipus).

Despite this ancient history, the term ‘data physicalisation’ has only just appeared in academic literature quite recently, in a 2015 paper by Yvonne Jansen, Pierre Dragicevic, and others:

A data physicalization (or simply physicalization) is a physical artifact whose geometry or material properties encode data.

What is data physicalisation?

Depending on its actual form, a data physicalisation can be seen, touched, heard, tasted, smelled, and more. Other than the artefact itself, the term also refers to the process of transforming data values into physical properties. When designing a data physicalisation, data can be used to shape the geometry of a physical object, much like how lengths, angles, and slopes are used to encode data in visualisations. Additionally, data can be encoded in the material itself and its characteristics. Texture, consistency, coldness, or weight, can all be used to both encode data and to set a specific mood and emotional relationship with the object, including by leveraging existing cultural meanings associated with the material.

Touching Air by Stefanie Posavec and Miriam Quick. Credit: List of physical visualisations.

Physical data objects can be more inclusive and intuitive than purely visual ones, and not just because they are accessible to blind people or because they have a lower tech barrier. After all, long before swipes and clicks, humans have thousands of years of experience in figuring out how to interact with the analog world through an almost infinite range of bodily motions and hand gestures.

Illustrations of the exploratory procedures and their associated material and object properties, adapted and redrawn based on Lederman and Klatzky’s 2009 research, by Simon Stusak, Exploring the Potential of Physical Visualizations, in the 2015 Proceedings of the Ninth International Conference on Tangible, Embedded, and Embodied Interaction (TEI '15).

The way that we interact with an object, and the physical gestures required to do so, play a key role in shaping our experience of that object and our future modes of interaction. Studies have concluded that being able to handle and manually interact with an object aids cognition, for example in the case of letting children play with physical letters to learn the alphabet. Research specifically focused on data is still relatively new, although there are a few excellent studies in which scholars have experimented to see if this argument applies to tangible data representations. Is our memory enhanced when multiple senses work on a cognitive task? Can we remember more if data is represented on a 3D physical bar chart that we can see, caress, and touch, rather than a 2D digital bar chart? Can we retrieve information from a data physicalisation more easily than we can from a digital visualisation?

A journalistic context for data physicalisation

The potential journalistic benefits of physical data representations are not limited to how such objects perform in terms of information retrieval and memorability. Physical news installations inherently have some characteristics that make them a great complement to online journalism. For example, they can be used to deliver interactive data journalism in places with limited internet access. They can be designed to be fully accessible for the visually impaired, unlike online data viz. They are also more straightforward to interact with than rich online interactives, and can therefore be used to teach and improve the audience’s data and visual literacies. (Think: Hans Rosling’s Population growth explained with IKEA boxes.)

Furthermore, physical data installations can have a role in fostering civic engagement, as Attila Bujdosó describes in his 2012 essay, Data embodiment as a strategy for citizen engagement:

“...public data embodiment objects promise the potential of generating collective experiences in a community, which can strengthen the shared identity among its members and develop a group responsibility for collective issues.”

There is even a specific practice of data physicalisation, called ‘Participatory Data Physicalisation’, that explicitly highlights this focus on the collective experience. In such works, the data physicalisation is not an object designed top-down, but rather, in the words of Matteo Moretti, a “shared experience, where visitors [become] participants: protagonists and recipients of the visualization”.

Participatory Data physicalisation @ TED Med 2017 Milan, by Know and belive, the Free University of Bozen-Bolzano, and Matteo Moretti.

The fact that a physical data representation does not need to be opened in a browser, but is somehow ‘always there’, means that it can promote an informal and casual experience, which is more inviting to novice and non-expert users. Newsrooms could coordinate with public spaces (like parks, libraries, or museums) to host events and news installations on topics of local relevance, like the local administration’s spending budget or traffic accidents. Such installations would continue to inhabit the place for defined periods of time, offering a constant visual token as people walk by it, perhaps spontaneously engaging with it and debating its message with other members of the community.

Developments in newsroom business models suggest that this path is compatible with what is working in terms of revenue. As newsrooms rely less and less on ads and more on membership models, subscriptions, and donations, there will be more room for experiments in ways to better engage the audience.

In parallel, newsrooms are increasingly grounding their practice in the community they work with. According to the Engaged Journalism Accelerator, which promotes similar initiatives, these kinds of journalistic practices have “the potential to restore trust in media, provide citizens with information they need and help establish new and resilient revenue models and enhance plurality and diversity in a crucial part of society’s information ecosystem”.

Lastly, journalism events are becoming more common and are a perfect match for physical data installations. NiemanLab’s research across Western media outlets puts physical journalism, “in the form of public meetings, festivals, events, and stage plays”, among the nine core ideas for innovative journalism. In the USA, The Washington Post, Politico, National Journal, the New York Times, NPR and others are all hosting live journalism events. The Texas Tribune generates almost as much income from events as from corporate gifts, while The Atlantic makes about 20% of its revenue from the roughly 100 live events it organises per year.

In short: it seems like a perfect time for newsrooms to experiment with data physicalisation. But where to start? The following section provides a quick primer on how to go from a .CSV or a .GEOJSON file to a 3D tactile map, all while learning the common tools and stages of a data physicalisation process using digital fabrication.

Crafting physical data objects

First, let’s look at the main steps that will allow you to transform a data file into a 3D map. As an example, we’re going to use data from the website Inside Airbnb, an independent watchdog that collects data to help researchers assess the impact of Airbnb in their city’s housing stock. We’ll focus on two European cities that symbolically sit at the opposite end of the spectrum in terms of regulation: Barcelona and Venice. On one hand, Barcelona can be seen at the frontlines of restricting touristic short-term rentals through a system of licenses and quotas, and via a direct agreement with Airbnb to access hosts’ data. On the other, Venice has very limited regulation, mostly just on tax issues. To compare these approaches, we’re going to visualise physicalise the number of entire apartment listings in a hexagon grid map. Once you familiarise yourself with this process, it can be repeated for any of the 70+ cities for which you find data on the Inside Airbnb website.

In this tutorial, we’re going to design the maps separately, so that they can be presented independently or next to each other like a series of small multiples maps. The design and fabrication of these maps has been developed by Chi ha ucciso il Conte?, an Italian designer and expert in digital fabrication and open source software.

Digital fabrication

To craft our data driven objects, we’re going to use digital fabrication techniques. Digital fabrication is a manufacturing process that uses the numerical values generated by computer software as coordinates to control machines and create objects with an accuracy that would be hardly achievable by a human being. Typical machines are 3D printers and CNC milling machines.

Not all data physicalisations are made through these kinds of semi-automated processes. For example, you could represent data through existing physical objects, or you could hand-craft data objects even out of Play Doh, Lego, or even by knitting. However, digital fabrication technologies have a series of advantages that make them a good match for our project:

  • they allow for rapid manufacturing of the desired outputs
  • their high level of accuracy makes them perfect to work with data
  • the same design file can be repeatedly used to create reliable and consistent outputs anywhere in the world, meaning that an unlimited number of news organisations (or local units of the same news organisations) could manufacture an identical output locally, using the original design file.

A typical digital fabrication workflow starts right after the designer has clearly established the look, structure, and functioning of the desired output on pen and paper, including the definition of more technical characteristics like the dimensions of all its various parts.

An example of a digital fabrication workflow.

The first component of a typical digital fabrication workflow is to transform these sketches and drawings into a digital, mathematical, and 3D model of all the components that need to be produced for the final output. This phase could encompass things like:

  • binding the data points and spreadsheet into visual elements such as the lines, curves, lengths, and widths of the physical object that will go into production
  • modeling the physical components needed to host sensors or motors to record and animate the data
  • building physical boards with engraved information about the data, axis ticks, labels, annotations, titles, and so on.

Once the 3D modeling phase is complete, this visual and human-friendly design needs to be converted into a file with machine-readable instructions, so that a manufacturing machine can accurately and reliably craft the designed objects. This procedure is called G-Code generation. The G-Code file is the one that eventually gets imported into the machine.

The workflow

Now to what happens at each stage of the data physicalisation process. We’ll just be covering high-level instructions, focused on the input/output of each phase and what each intermediary step should be, rather than detail of each software’s interface or how to practically execute the commands (although we will provide links on where to find additional resources for each). If you have specific questions, feel free to contact the author.

For this example, we will only use a 3D printer, as it is the most approachable of the possible machines to build a 3D hexagonal grid map, along with entirely open source software:

Our workflow will show you how to create a single map for the city of Barcelona. Creating the second map for the city of Venice (or any other city) will require the same steps, just with different input data. I have imported all of the output files mentioned at the different stages into this GitHub repo. Feel free to download them and use them for reference, or to directly print your own versions of these two maps, skipping the preparation and modeling phase.

1. Prepare the data

Software: QGIS

Input: Two files downloaded from the Inside Airbnb website: the ‘neighbourhood.geojson’ file and the ‘listings.csv’ file for the city of Barcelona.

Output: Two files for each city:

  1. One .SVG file with the outer boundaries of Barcelona. We’ll call it ‘barcelona-boundaries.svg’.
  2. One .SHP file with a hexagonal grid vector layer, containing the number of entire apartments in each hexagonal area. There should be a minimal buffer between the hexagons, and the grid should include only areas with one or more apartment. We’ll call it ‘barcelona-airbnb.shp’.

To create the .SVG file:

  1. Open QGIS and import the ‘neighbourhood.geojson’ file.
  2. Apply the dissolve command on the layer to delete all inner boundaries dividing the different neighbourhoods, resulting in a vector layer containing only the outer boundaries of the city.
  3. Use QGIS’s print composer to export the .SVG of this simplified layer. Let’s call this file ‘barcelona-boundaries.svg’.

To create the .SHP file:

  1. Import the ‘listings.csv’ file in QGIS.
  2. Filter the ‘listings’ to include only entire apartments.
  3. Create a regular grid vector layer of 500x500m hexagons.
  4. Clip the hexagon grid with the dissolved neighbourhood layer.
  5. Count the number of points in the filtered listings layer belonging to each hexagon of the grid.
  6. Add a minimal negative buffer (-0.01 meters) to make sure the hexagons of the grid don’t touch each other. This is important because the 3D modeling software you will use needs to work with distinct solids without any overlapping lines, and in the default hexagonal grid created by QGIS, the polygons overlap. Giving the buffer zone a very minimal number removes any overlapping while creating a gap that is essentially imperceptible to the eye.
  7. Filter the buffered layer so that it includes only hexagons for areas with at least one entire apartment within it.
  8. Export the layer as a .SHP file, with the CRS set to either WGS84 or Web Mercator. Let’s call this file ‘barcelona-airbnb.shp’.

2.a. Create a 3D model for base of the 3D map

Software: FreeCAD

Input: The .SVG file created in the previous phase, ‘barcelona-boundaries.svg’.

Output: Two .STL files, one containing a square base; another with extruded 3D boundaries of the city. We’ll call them ‘barcelona-base.stl’ and ‘barcelona-city-map.stl’. They will be used for reference in the next step.

To create the .STL files:

  1. Open FreeCAD.
  2. Draw a 20x20cm square. This will serve as a reference point for how big the map should be. The size can depend on the printer you have access to: the bigger the building plate, the bigger you can make the map and its base.
  3. Extrude the square by 5mm.
  4. Export this extruded object as a .STL mesh (called ‘barcelona-base.stl’).
  5. Import the .SVG with the city’s outer boundaries on FreeCAD.
  6. Extrude the .SVG by 2mm.
  7. Export this extruded object as a .STL mesh (let’s call it ‘barcelona-city-map.stl’).

2.b. Create a 3D model of the hexagonal grid map layer

Software: Blender and the Blender GIS Add-on

Input: The .STL file with the city map created in the previous step and the .SHP file created in the first phase, for both cities. So: ‘barcelona-city-map.stl’ and ‘barcelona-airbnb.shp’.

Output: A .STL file containing a 3D model of the hexagonal maps, with the height of each hexagon defined by the number of entire apartment listings in that area. We’ll call it ‘barcelona-airbnb3d.stl’.

To create the .STL file:

  1. Open Blender.
  2. Import first the .STL file with the city map.
  3. Import the .SHP file with the hexagonal grid and the data on entire apartments. You will see this option only if you have installed the Blender GIS Add-on.
  4. In the import options, before clicking ‘OK’, make sure that you specify the following parameters:
    1. Extrusion from fields: tick the check box
    2. Field: choose the field that contains the number of apartments
    3. Extrude along: choose ‘Z axis’
    4. CRS: specify the correct CRS.
  5. The imported .SHP file is probably very big: resize it so that it snaps into the city boundaries of the city map .STL file.
  6. Perform a boolean union operation to unite the base city map to the 3D hexagons into a single solid.
  7. Export the solid as an .STL file. Let’s call the file ‘barcelona-airbnb3d.stl’.

3. Create the GCode for the 3D printed hexagonal grid map layer

Software: Slic3r

Slic3r is a ‘slicer’ -- a software that processes the 3D model and generated instructions for the 3D printer, for example by slicing the volume into thin layers, creating supports for hanging elements, and so on. The version of Slic3r used in this tutorial is a custom version tailored to the 3D printer used, a Prusa i3 mk3, but you will find the same commands in any Slic3r version. If you are using a general version of Slic3r, then you will need to set the configurations appropriately for your specific 3D printer. It’s wise to talk to the owner/operator of the 3D printer in this phase.

Input: The .STL file containing the 3D model of the hexagonal maps, so ‘barcelona-airbnb3d.stl’.

Output: A .GCODE file that can be sent to the 3D printer. We’ll call it ‘barcelona-airbnb-3dprint.gcode’.

To create the .GCODE file:

  1. Open Slic3r.
  2. Add the .STL file with the 3D hexagonal grid map, ‘barcelona-airbnb3d.stl’.
  3. Check if the print settings are correct (the default ones usually work fine) and slice the model.
  4. Export a .GCODE file of the sliced object. Let’s call it ‘barcelona-airbnb-3dprint.gcode’.

4. 3D printing the map

Hardware: Prusa i3, or other 3D printer

For this part of the project, it is advisable to first contact the person who will be operating the machine to produce your output. This because every machine has its own specs, capabilities, and possibly even software.

Input: the .GCODE file with the instructions for the 3D printer.

Output: a physical map of Airbnb entire apartments distribution in Barcelona.

To 3D print the map:

  1. Copy the .GCode file (‘barcelona-airbnb-3dprint.gcode’) onto an SD card/USB drive depending on what your 3D printer reads.
  2. Insert the SD card into the 3D printer.
  3. Prepare the PLA filament and insert the spool in its lodging, if it is not already in place.
  4. Insert the lead of the filament into the hole on top of the extruder.
  5. Preheat the extruder and the building plate.
  6. When the extruder and the building plate reach the temperature, load the filament. Check that it flows down smoothly, otherwise repeat steps 5-6 again.
  7. Select the .GCode file and print it. For reference, consider that it should take approximately 1 hour and a half to complete. You don’t need to sit next to your 3D printer the whole time, but it might be useful to check on the process every once in a while.
  8. When the piece is completely printed, delicately remove it from the building plate. The piece might be attached quite firmly, so you might need a spatula to aid you in the operation.
  9. Use some pliers to clean up the object by removing extra filament traces around the edges.
  10. PLA plastic is very easy to recycle: make sure you discard the scraps and waste appropriately, for example by bringing it to a PLA filament recycling facility or at least by throwing it away in a plastic recycling bin.

5. Assemble

You can layout your 3D maps on a wooden support, eventually adding contextual information like titles and a 3D legend. You could write them in pen, paint them, or even try to 3D print them like we did. You can repeat this process to portray multiple cities.

In these tactile maps, data patterns are communicated through multiple senses, and both sight and touch work together to show how the two cities have similarities in terms of geographical distribution -- Airbnb listings are concentrated in the central areas -- and stark differences in terms of scale.

The center of Venice has a concentration of up to 444 listings inside the area of a single 500x500m hexagon, while in Barcelona this maximum number is halved at only 207 listings. This disproportion is even more significant if we consider the population of the two cities: Venice is home to roughly 270,000 people (of which only 52,000 live in the central ‘fish’ that hosts the majority of Airbnb listings), while Barcelona is home to 1.7 million.

These types of tactile maps are a very basic data physicalisation type that can even be produced by beginners. At the same time, they are a novel and powerful way to present data beyond purely visuals cues, allowing people to interact with your data and retrieve information from it in a physical setting.

Conclusion

As software and hardware progress, the technical barriers to making data physicalisation will lower, as it did for data visualisations. But, technology aside, data physicalisation is first of all a process that begins in the mind, by thinking creatively about what can be used to encode data beyond the visual and what types of experiences you want to foster in your community. And for this, you don’t need complex technical skills, it is first of all about imagination. So, I’ll leave you with this question: What data physicalisation do you imagine for your next story?

]]>
Evidence of a solution: using data to report more than just bad news https://datajournalism.com/read/longreads/evidence-of-a-solution Thu, 27 Jun 2019 01:00:00 +0200 Brent Walth https://datajournalism.com/read/longreads/evidence-of-a-solution One of the most dangerous things a woman can do is give birth. In recent years, health officials around the world have been working to reduce this age-old threat to mothers. And as Michael Ollove, a senior health reporter with Stateline revealed in late 2018, there’s good news.

“Over the past three decades,” Michael wrote, “the world has seen a steady decline in the number of women dying from childbirth.”

The bad news? “There has been a notable outlier,” he wrote. “The United States.”

Michael used data collected from the Centers for Disease Control and published by The Lancet, which show that maternal death rates have been falling around the world while climbing in the US. As Michael noted, that put the U.S. “in the unenviable company of Afghanistan, Lesotho and Swaziland as countries with rising rates.”

Michael’s troubling revelation in Stateline, a news service published by the Pew Charitable Trusts, later appeared in The Washington Post. His story examined the reasons for the rise in deaths—and could have gone further to point fingers and lay blame among government officials, health care providers, and insurance companies, all of whom might well be failing to act to reverse this awful trend.

The story might have stopped there, leaving readers with this grim news. Instead, Michael turned the spotlight on the state of California, a place that had seen a rise in maternal deaths but has since witnessed a steep decline. The state's maternal deaths are now only a fraction of the death rates across the rest of the US.

Importantly, Michael’s story showed why. He highlighted the strategies that Californian health officials took to uncover the reasons behind climbing death rates and pinpoint the specific causes, as well as the practical steps they took to fix problems in hospitals across the state.

What’s striking, as Michael’s story also showed, is that the fix for this tragic trend can be used elsewhere.

“This isn’t some weird California thing that can’t be replicated,” one leader in the fight to cut maternal death rates told Michael. “This is doable in other states. It’s a matter of having the will and the funding to get it off the ground.”

Michael’s story is an example of an increasingly important trend in enterprise reporting: solutions journalism storytelling that refuses to let problems lie by examining possible ways to wrestle with meaningful issues in the community.

Solutions reporting identifies long-standing social issues and problems, and then tells the story of people who have demonstrated success in addressing them.

The idea of journalism that responds to the needs of the community has been around for a long time, but the attention to solutions journalism has gained momentum in the face of declining readership and repeated refrains that the news media dwell on negative news. Solutions journalism demonstrates that the news media see a broader role in investigating opportunities for change, reform, and hope.

On the face of it, this approach may sound like advocacy, and that’s created real concern and alarm that reportage, which promises to solve anything, steps well beyond the boundaries of objectivity. To be sure, that risk is real, especially if news outlets push only happy news or manufacture stories to make local do-gooders look like heroes.

But solutions journalism, by definition, resists the lazy path. Instead, these stories avoid advocacy by relying on evidence and demanding the rigorous standards that journalists have always embraced.

And that creates enormous opportunities for data journalists. These stories need numbers and solid analysis, which frames the problem, tests possible solutions, and demonstrates that the answers are there for everyone to pursue. In other words, the reporter applies journalistic standards to examine how people are working to address real-world problems, all in the hopes of telling a story that can make a difference.

In the end, the solutions journalist needs to go beyond 'So what?' The question they must also ask is, 'Now what?'.

The solutions approach

A primary role of journalism is to shine a light on what should matter in their communities. In The Elements of Journalism, Bill Kovach and Tom Rosenstiel establish a bill of particulars when it comes to the essential mission of the news. Their charge that 'journalists must serve as an independent monitor of power' is a key tenet in their argument of journalism’s civic duty.

Bill and Tom also recognised the detrimental effect of a constant flow of negative news—a flow that could alienate readers and undermine the press’s central mission: “(T)he press should recognize where powerful institutions are working effectively, as well as where they are not. How can the press purport to monitor the powerful if it does not illustrate the successes as well as the failures? Endless criticisms lose meaning, and the public has no basis for judging good from bad.”

So, too, does Jay Rosen, a renowned media critic and associate professor of journalism at New York University, has long argued that journalism has a duty to speak to communities in a way that help people address problems. As Rosen told the Knight Commission on Trust, Media and Democracy in 2018: “Your report is incomplete — lacks depth — unless it includes what we can do: as individuals, as a society or political community.”

In recent years, many big thinkers in news media have talked about this solutions approach as not just a duty, but as a means of survival in a business that is losing readers, viewers and, too often, a reliable business model.

“Solutions journalism,” Karen McIntyre, an assistant professor of multimedia journalism at Virginia Commonwealth University, wrote in 2017, “is intended to be a more productive style of reporting that might present at least a partial remedy to the increasingly apathetic and frustrated public that has resulted from the mainstream news industry’s conflict-based content.”

In Europe, the approach is often called constructive journalism; however, the two approaches align so closely that many observers see little difference between the two. Sean Dagan Wood, publisher of Positive News, as describing the idea in 2014: “This is about bringing positive elements into conventional reporting, remaining dedicated to accuracy, truth, balance when necessary, and criticism, but reporting in a more engaging and empowering way.”

Studies have proven the power of solutions journalism to engage. Online readers of solutions-based stories, compared to a similar story that looks only at the problem-side of an issue, spent more time with the solutions story and came away with more optimism. Readers also didn't come away believing that journalists were straying from their primary mission by writing about solutions.

So, what are we talking about when we talk about solutions?

The Solutions Journalism Network is a non-profit that promotes this type of reporting and trains journalists in how to pursue it. The organisation has a 10-point test to see if a piece of reporting fits a definition of solutions story.

I’ve sought to summarise those points here. A solutions journalism story must:

  • Show how are people are responding to a consequential social problem with an approach that has proven evidence of success.
  • Tell the story of how the solution came together, show how the solution actually works, and be realistic about its limitations.
  • Bring to life the ways in which people on the ground have dealt with the problem, and don’t just rely on outside experts who lack hands-on experience.
  • Instead of labeling people heroes, a solutions story shows “characters grappling with challenges, experimenting, succeeding, failing, learning. But the narrative is driven by the problem solving and the tension is located in the inherent difficulty in solving a problem.”

As one key Solutions Journalism Network standard puts it, “Does the story avoid reading like a puff piece?”

For it to be solutions journalism, the answer must be yes.

The Solutions Journalism Network.

The role of data

A compelling part of the definition of solutions journalism is the demand for evidence. The scale of the problem needs to be demonstrable, and the response to the problem needs to be measurable. And that’s where the data journalist can play an important role.

Solutions journalism can use data to identify the problems.

These stories, by their very nature, go beyond what information is handed to a reporter. Instead, the journalist goes in pursuit of stories about the work of addressing social problems.

The journalist pursuing solutions stories can also play another role: that of watchdog, seeking to hold people in power and authority accountable. A solutions story could go even further and move into the realm of investigative reporting, especially as defined by Investigative Reporters and Editors: “The reporting, through one’s own initiative and work product, of matters of importance to readers, viewers or listeners. In many cases, the subjects of the reporting wish the matters under scrutiny to remain undisclosed.”

So, it makes sense that investigative data reporting and solutions journalism could go hand in hand.

How often does this happen? Not nearly enough.

Research we’ve completed at the University of Oregon has shown that solution journalism stories rarely uncover the problems they discuss. Instead, these stories focus on problems that are well established, widely accepted, or are based on the research and data published by others. Most often, reporters pursuing solutions stories take on well-established problems that have often been framed by the people trying to address them.

In some ways, that makes sense: The better established the problem, the more likely there will be tested solutions about which a journalist can write.

But many enterprising reporters have shown how data analysis, investigative work, and solutions storytelling can work together.

A 2014 investigation by the Post and Courier of Charleston, South Carolina, showed how women in its state faced the nation’s highest risk of being assaulted or killed by men. The story was driven by data compiled from government records and analysed by the newspaper’s journalists.

But the story didn’t just point fingers. The series, which revealed failures in state laws and a lack of help for battered women, highlighted programs in other states that had proved they could curb this dangerous trend.

The Post and Courier clearly outlined solutions to problems in their Till death do us part investigation.

In 2016, Joan Garrett McClane and Joy Lukachick Smith of the Chattanooga Times Free Press published a series called The Poverty Puzzle that used data to examine chronic poverty in its community, and then looked for tested alternatives to help people break out of their economic cycles.

In both cases, these investigative series paired the power of data with solutions to deliver a more complete look at these problems. The Chattanooga series was a finalist for a Pulitzer Prize, and The Post and Courier won in the Pulitzer’s highest category, the gold medal for public service.

Data journalism can help show a solution is working

Many of the pitfalls faced by data journalists in other kids of stories are at play in solutions journalism as well: Make sure your data are accurate, verified, and confirmed. Don’t jump to conclusions based on a first look at results. Understand and explain the potential weaknesses and limitations of the numbers. And always be transparent about the source of the numbers, including why and how they were collected.

Michael Ollove’s story about California’s maternal death rates, published by Stateline and the Washington Post, relied on widely available data from the Centers for Disease Control WONDER database. To back up those findings, the story cited rankings performed by an independent non-profit that also concluded there was a remarkable difference between death rates in California and the rest of the US.

The numbers make a compelling case about California’s solutions. But the journalist took the extra step of making clear the potential limitations: Californian officials were not ready to claim their work was responsible for all of the improvements, and that more research was required.

In Vancouver, Washington, school officials produced meaningful declines in chronic absenteeism among all students, but especially among those who live in poverty or who have no permanent address to call home.

The Seattle Times, which runs a solutions journalism-based project called Education Lab, in 2018 published a story that highlighted the improvements in Vancouver and demonstrated their effectiveness using data from both the local school district and the state. The Times went further by performing its own analysis to verify the results.

Their data helped demonstrate another key idea of solutions reporting: That the solution is portable, and that other cities have adopted similar programs.

The Seattle Times.

Data can help protect solutions against claims it’s advocacy

The smart journalist loathes to get played by powerful institutions who want to push a happy, all-is-well message when the proof might show otherwise. Inside the newsroom, there’s often pressure on reporters to avoid appearing too soft or being a duped by people who have easy and politically convenient answers.

In the classroom, I advise students interested in solutions reporting to frame their stories in a way to avoid being accused of political advocacy: Seek stories where the issues go beyond ideology. Focus the story on people instead of institutions. Accept that no story is truly objective, but that the journalist can rely on objective methods and fairness to relate what they have found.

Solutions journalism, through its reliance on evidence, can help move past this concern. The need for solutions storytelling to avoid being seen as advocacy rests to a great extent on the source, context, validity, and strength of the data.

In 2017, University of Oregon journalism students published a solutions story about the municipal court in the city of Eugene, Oregon, where homelessness has created a growing caseload of misdemeanors and violations.

The city used a $200,000 federal grant to create a second-chance program called Community Court, which allowed defendants facing low-level, non-violent crimes to have their charges cleared if they stayed out of trouble.

City officials claimed the program was working to reduce crime and caseloads in the courts, but they offered no evidence of their claims.

Two years later, student journalists returned to investigate the solution. They used a data analysis to track 789 defendants who had been eligible for the community court, and their data analysis showed that the program had made no difference in recidivism rates.

Their story showed the claims of city officials were inaccurate and could spur more stories about what solutions might actually work to help the area’s homeless.

Eugene Weekly.

In 2016, ProPublica examined a system presented as a solution that could reduce jail overcrowding and curb racial disparities. In many cities, defendants and convicts are given a risk-assessment scores, algorithm generated numbers that are supposed to predict the person’s likelihood of committing future crimes. Judges can use this information to set prison sentences, and corrections officials can use the information to determine when to release inmates.

ProPublica’s investigation, Machine Bias, raised questions about the solution -- it found the scoring system was biased against African-Americans, inaccurately tagging black defendants as more dangerous than whites.

In the end, solutions stories can turn readers’ heads. The evidence -- and the data -- make those heads nod in understanding.

]]>
Designing data visualisations with empathy https://datajournalism.com/read/longreads/data-visualisations-with-empathy Tue, 11 Jun 2019 00:00:00 +0200 Kim Bui https://datajournalism.com/read/longreads/data-visualisations-with-empathy In 2015, Jacob Harris -- who was then working for the New York Times as a software architect -- wrote about different methods of empathy used in visuals, from using people as icons instead of dots, and zooming in on a singular slice of a larger issue. He worried that standing too far from the humanity of the topic being visualised was a disservice to audiences.

“As data journalists, we often prefer the ‘20,000 foot view’, placing points on a map or trends on a chart. And so we often grapple with the problems such a perspective creates for us and our readers—and from a distance, it’s easy to forget the dots are people. If I lose sight of that while I am making the map, how can I expect my readers to see it in the final product?”

A journalist’s job is to ask questions of the world and to show the truth in their answers. In an ideal world, journalists reflect a community back to them -- yet, with slow moves towards true diversity and inclusivity in newsrooms, it is harder for all communities to see themselves reflected in their local news coverage, much less the charts and visualisations accompanying these stories. Empathy is one potential answer to bridging this gap between journalists and the communities they serve.

My prior research on empathy for the American Press Institute looked at this issue. The theory is that approaching stories -- and people -- with more empathy creates better relationships with marginalised communities, builds trust and increases diverse coverage. Empathy in this context is seeing a person’s actions and motivations from their point of view. It is not to be confused with sympathy, which is mirroring your own feelings with another’s to aid understanding.

P. Kim Bui is the director of audience innovation at the Arizona Republic and author of the American Press Institute report The empathetic newsroom: How journalists can better cover neglected communities.

But why empathy?

Empathy is one way for journalists to explain the world around them as it really is. While it may seem that objectivity would get one further, telling a story with empathy allows the audience to connect more closely with the human side of an issue.

People enjoy interacting with stories that push them forward and show them a different perspective. To truly understand another person’s perspective, journalists must learn to practice empathy.

“A journalist can understand what led a boy to become a violent gang member and convey that in a story. That doesn’t mean she makes excuses for his behavior. In fact, painting a fair, accurate picture of a life -- and doing so with empathy -- sometimes involves pointing out things that the subject prefers not to acknowledge.”

-- American Press Institute

There are three kinds of empathy, as I outlined my research, and two of them are incredibly applicable to all kinds of journalists:

A reporter can employ cognitive empathy to approach an underserved community, using techniques that help them understand people with opposing views and from different backgrounds.

Reporters can also practice behavioural empathy by using verbal and nonverbal signals to show they’re working to understand another person’s feelings and ideas. These signals can be simple, like putting your pen down to let someone cry or looking into their eyes as they speak.

The third kind of empathy, affective empathy, is where one mirrors the feelings of another. This type makes many journalists uncomfortable. They believe sharing a source’s emotions is a sign they’ve gotten too close and jeopardised their impartiality.

But empathy can also go beyond the interviewing and reporting stage. In visualisations and data, journalists and technologists utilise these storytelling tools to put a different face to research and numbers. It’s about contextualising numbers, which are often seen as cold. Created with empathy, visualisations can provide both a close and a wide view of a particular issue or story within a community.

Reporting and storytelling methods like photos and videos, in a way, have an easier path to empathy. It is easy to try to understand another person’s motivations and perception of the world when you are reading their words, hearing them, and seeing them. What is harder is finding the people and shared experience within numbers and data. How do you get audience members, much less the journalists presenting the story to the audience, to walk a mile in the shoes of a dot? Or a bar chart?

It may take extra thinking, and more questions, but empathy can be achieved. Some journalists, like BuzzFeed’s Lam Thuy Vo, seek to humanise and find the surprising human connections within data. Lam treats data as another source, but not a pile of papers or a PDF. She is as curious and thorough with her processing of data as any reporter would be during an interview with a human source.

The humanity in data

Lam thinks of data the same way Kameelah Rasheed thinks of her art, she says. Most things tend to not follow a single line, and there are multiple ways of arriving at a story, depending on how you arrange things. In Guernica Magazine, Kameelah explains it this way:

“I reject the way that we have imagined the making of the archive as an administrative, objective, almost sterile process. I feel like archives are very dirty, very messy, really. Archiving is a subjective process; it’s a process that I hope engages and is relational and is not about someone sitting alone in an office. I have around four thousand found images of black families, and for me, making that archive is no different than creating an installation, because in both circumstances I’m collecting; I’m accumulating and I’m also trying to establish relationships between the things that I’m collecting. So in this archive of four thousand found images of black families, I’m making decisions. Do I categorize photos based upon the geographic location? Do I categorize photos based upon the year? Do I separate images of domestic settings from those of more public settings? The thought process that I go through when I’m archiving images is very similar to what I go through when I’m installing for shows: How do I organize what I’ve accumulated?”

Take categorisation, for example. Lam spent a lot of time thinking about which buckets we put people in as we explain data. People do not fit into boxes perfectly, so data can be just as malleable.

“Every categorisation is so flawed. What you need to do is understand it's limitations and the scope of the categorisation and present that fully to the audience [in a way] that is clear and somewhat easy to understand without taking away the complexity. It's this bizarre dance.”

Lam spent a significant amount of time looking at what gentrification really meant and how to define it when examining 311 calls in a particular neighborhood in New York City for BuzzFeed. It would be easy to conflate race with gentrification, assuming that a rising population of caucasian residents means gentrification is happening. This makes sense because of inequality in the United States, but then gentrification also means rising home values, and rising incomes. However, if you look at a young person whose only income is from a trust fund, that skews incomes. She looked at other research and several models of defining gentrification before arriving at one used in a particular study.

When the story published, Lam transparently laid out the methodology for defining gentrification this way: “There’s no universally accepted definition of gentrification, but BuzzFeed News used a methodology developed by Governing magazine, which in turn is similar to prior academic work. It uses Census data on income, home prices, and education -- but not racial or ethnic demographics.”

How to employ empathy in data visualisation

Approach 1: understanding fuzzy data

Gentrification is not the only categorisation that is somewhat subjective. Many of the buckets we put data (and ostensibly, people) into can be defined in different ways. For example, it took until 2000 for the US Census to include multiracial identities, while many other countries still have not, and we continue to struggle with legal identifications for transgender or non-binary communities. These are definitions that might be more subjective than journalists often think they are. Or categorisations, that might not tell the whole story. Think about a single father whose children are taken away from him: He might be classified as a neglective parent, but that could fail to show the systemic problems (like poverty or racism) which may have contributed to the categorisation.

While data may seem cut and dry, people are not. The world is not. Everything is much more complex.

Thinking back on his time working on visualizations, and now watching the journalism world as a developer in civic tech, Jacob Harris reflects in an email interview about how fuzzy data really can be.

“We rarely get data that is exactly what we want to report, and so very often this analysis involves with selecting a suitable proxy model that is close enough to what we want to measure and doing additional analysis on it. The problem is these proxies always are approximations of some sort with limitations and errors and often missing data (sometimes especially so for vulnerable populations), but putting things in a chart or as dots on a map can add an illusion of certainty and precision that isn’t there in the source data.”

Seeing these gaps and problems in your data is the first step. Representing limited or fuzzy data might mean a graphic isn’t a quick turnaround anymore, or that it requires more research. For some data journalists, it might mean going beyond percentages and using little people in visualisations.

While data may seem cut and dry, people are not.

Some questions for journalists to ask as they evaluate data:

  • Who is vulnerable in this story and how would they want to be counted?
  • What information would they need to improve their lives?
  • Who is undercounted or possibly missing entirely?
  • Who was counted? Who did the counting? Why were they asking these people?
  • Who benefits if you forget the dots are people?

Approach 2: the Quantified Selfie

There are times when looking at a single data point can help. Lam calls this an expansion of the idea of a ‘quantified selfie’. The concept of the quantified self comes from the mid-2000s, as use of personal tracking exploded. Fitbits, hydration counters, GPS trackers, and other methods of creating more and more data about ourselves became available. A quantified selfie takes that data and turns it back on ourselves (like a selfie does).

For Lam’s quantified selfies, she uses a single person’s experience as a prototypical experience for a trend. It’s a way of countering a lack of precise data and statistical relevancy -- a way to tell stories that are data driven and emblematic.

“We frontload our stories with that statement: It's not representative but it's emblematic. You can redo this, and we're happy to help.”

Screenshots from one of Lam’s Quantified Selfies, called Forget Me Nots, which explores the concept of relationships through the data in one person’s inbox.

Quantified selfies are particularly strong storytelling mechanisms when it comes to social media and allowing audience members to see themselves in the data. The technique might also have an app or quiz that lets people put their own data through the process used for the story, literally allowing themselves to empathise with another person’s experience.

Jacob provides an example of how quantified selfies and similar visualisations use “statistical methods and animation to provide personal perspectives of macro population trends that aren’t data about individuals per se”. He points to the New York Times’ popular income mobility visualisation as an example that uses empathy in data. The visualisation allows people to see how experiencing income inequality as a child affects the experience of income inequality as an adult. It also allows people to make their own comparisons using different races and income grouping.

The New York Times interactive tool lets you make explore income inequality for virtually any combination of race, gender, income type and household income level.

Approach 3: showing the chart isn’t enough -- empathetic stories need people

In an era where graphics and visualisations no longer always live next to stories, journalists have to think about how each disparate piece of their work might be interpreted. It is easy to display empathy in a 2,000-word piece where a journalist has sought to understand another person’s motivations and how they came to be.

But a standalone graphic is a much more difficult thing. Wee people can distort an image of a person as much as it can help display it. In these cases, transparency can help remind the audience of what each dot or each tick means.

“The way in which we collect data, is kind of like a fly frozen in amber. It’s a direct reflection of the values in that point in time and what we deem is important.”

-- Lam Thuy Vo

“The best thing news orgs can often do in some circumstances is just to reinforce the human basis of most metrics we hear on a regular basis (like how is the unemployment rate defined, what does it mean that it ticked up a little, what is not being counted), but that only gives a limited amount of empathy,” Jacob said.

It’s also worth looking at whether representations of little people can be given more humanity. Ari Melenciano, a creative technologist and researcher, has been exploring empathy and its ties to graphics representing social justice as part of her 2018 Processing Foundation Fellowship. She was interested in how to make people care more about the social justice stories that data can tell, like about racial profiling and the high incarceration rate of minorities in the United States.

“I knew I wanted to explore different ways I could explicitly implement human aspects to the designs,” she wrote on Medium. “I felt that if my audience was able to associate a face or human element with designs that portray a social justice issue, the numerical facts would then be associated with an actual human life, and not just serve as a statistic.”

She looked at different ways of bringing a more human element into her visualisations, for example adding in celebrity faces rather than using small representations of people. Her concern was that in the end, even little people are just symbols and shapes that are there to represent human lives.

“Even the human symbol is an icon, maybe it needs a face on it maybe it needs a different style of drawing,” she said.

She ended up with illustrative drawings that put faces on top of statistics, like how many black men go to prison, or how often black drivers get pulled over compared to white drivers. The drawings weren’t always precise, but as a standalone graphic, they told more of a story. Illustrations, she said, are a little less abstract than icons and people likely identify with them more.

Illustrations can be one tool for promoting empathy in data visualisation. Credit: Ari Melenciano.

She also examined her own bias and mission as she created each graphic.

“(I explored] finding ways to make sure I was as even as possible, to not look for things that were answering my question. Sometimes it’s not telling the story I want it to tell, so maybe not ignoring that.”

Ari is continuing to explore different ways of bringing humanity into visualisations, thinking about finding ways to allow people to dive deeper into a topic, or using sound as a way to bring people into a scenario.

In 2017, data artist Giorgia Lupi gave a TED talk that focused on the question of how small pieces of data might show empathy. She described a piece she worked on featuring Italian astronaut Samantha Cristoforetti. The app she built allowed people to say hello to Cristoforetti as she passed by on the International Space Station. Those ‘hellos’ were shown together, hundreds of individual people saying hello to a single person high above their heads.

Showcasing our own curiosity brought many different people together, Lupi said. “Data powered the experience, but stories of human beings were the drive.”

How data gets displayed and how it is powered are two ways of drawing lines between an individual and a data point. It is perhaps that while narrative stories transport the audience into the shoes of another person, empathetic data journalism and visualisation brings a different person’s shoes to the audience.

Giorgia describes the interaction of understanding and empathy with data as a conversation: “I'm asking you to consider data -- all kind of data -- as the beginning of the conversation and not the end.”

Conclusion

Finding empathy in data visualisations and presenting it to the audience requires as much forethought as it does in reporting, but the questions are often the same. Data is part of the larger conversation we have with our audiences. As much as the words we use and the questions we ask in reporting matter, so does the way we represent people and the categories we choose to put them in.

Learn more about the human side of data in:

]]>
Data in the air: a guide to producing data journalism for radio https://datajournalism.com/read/longreads/data-in-the-air Tue, 28 May 2019 07:00:00 +0200 Michael Corey Adele Humbert Jacques Marcoux Sophie Chou, Petr Kočí, Paul McNally https://datajournalism.com/read/longreads/data-in-the-air Just because you can’t see something, doesn’t mean it’s not there. While this statement is true for many things, there’s perhaps no better way to describe the emergence of data journalism in radio.

Since the field gained momentum in 2012, newsrooms around the globe have approached data with their eyes first -- adopting the latest and greatest visualisation tools and techniques, advocating for mobile-first and responsive design, and wow-ing readers with increasingly immersive data visualisations -- to the point that data journalism is often considered synonymous with data visualisation. And yet, data can power all forms of reporting, even those that aren’t visual.

Despite this oversight, or perhaps in spite of it, there is no shortage of excellence in radio data reporting. For years, Reveal has been producing investigative radio programming, often heavily derived from data. There’s also the BBC’s audiographs, which take statistics and turn them into sound to illustrate the headlines; an increasing adoption of sonification techniques from journalists across the board; and the fact that almost all journalism can leverage data in its underlying research phase.

In this Long Read, we’re stepping away from the old adage, where a picture is worth ‘a thousand words’, so that we can begin thinking about how many a sound is worth. To help you start listening to data, we picked the brains of six experts, with backgrounds in radio or audio storytelling, to uncover how these formats can be harnessed for data journalism.

Showing you how it’s done are:

How are audio formats different?

Our discussion started out with an acknowledgement that numbers simply don’t stick the way they would online or in print. Why? Well, as Adèle Humbert pointed out, listeners are often engaging with radio while they’re doing something else -- housekeeping, or while driving are common examples. As a result, it can be risky to include too much data in an audio piece because these listeners may not be engaged at a technical cognitive level.

This means that “you don’t get to put as much ‘data’ in a radio story compared to print”, said Paul McNally, “so you have to be tactical in terms of how you script chunks of numbers into your piece and not lose your audience”. Avoid listing off statistics, for example -- depending on your listener’s surroundings, all these numbers could blend into one.

While this may mean that radio is a limiting medium for some data journalists, Sophie Chou believes that these limitations offer an opportunity for journalists to make their reporting more emotionally impactful.

“As a data reporter, it’s my top priority to make my stories accessible to all audiences, not just ‘numbers’ people. I think that the fact that you can’t pack too much numerical information into a radio segment (or podcast) actually challenges me to think creatively about how to put the human story at the front of my work.”

Putting human stories first

When we use data as the foundation for a story, rather than the story, our experts agreed that there really isn’t that much difference between radio and other formats.

“What has become abundantly clear to me is that listeners don’t care whatsoever about how the core of your story was derived; what they care about is what impact it has on them or someone else. So in that sense, the output of good data journalism should almost be void of any ‘data’. When you reframe it like that, you realise radio as a medium doesn’t really handicap your data work,” said Jacques Marcoux.

Data journalists tend to put so much value in their methodologies, coding, and analysis, but doing so can mean that they lose sight of what the average listener cares about. Instead, Jacques suggests considering whether or not the ‘the data’ is the story, or whether it’s a conduit.

“I understand the urge to highlight the behind-the-scenes work, but at the end of the day, it’s the human element that keeps people tuned in...and it just so happens radio is probably the best medium to achieve this,” he said.

Michael Corey agreed, “it’s important not to separate data from the rest of what we do”. Regardless of medium, journalists should ask themselves about the main takeaway they need the audience to have. Is precision really important? Or are you trying to convey the shape of a dataset?

“Sometimes that story needs data and sometimes it doesn’t. But if data or explaining a technical concept is truly important, I’m a big believer in leaning in. I was on a panel with Madeleine Baran and Will Craft from APM Reports a few years ago, and I really liked their formulation that data should be a character in your story. And a good character is worth developing -- give them some air time and let your audience get to know them. I think too often we’re scared of numbers. The temptation is to sneak one statistic in there and then get out ASAP. But if you had a character with one quote in a longform story, chances are that character just needs to be cut. Same with data,” Michael said.

Adèle Humbert’s work on the Paradise Papers at Radio France provides a good example of applying this storytelling principle. After data mining for months, with over 13 million documents to analyse, her reporting was guided by one key question: ‘How can I tell stories from very technical data?’. In answering this question, she knew that the human stories and the people involved in the data needed to be at the heart of the audio stories. So, she produced a series of short audio stories, focussing on several main characters, with online articles to complement these and explain the underlying technical details.

In her reporting on the Paradise Papers, Adèle focussed on the human characters, such as Lewis Hamilton.

How to represent data through audio

Since radio doesn’t offer the ability to represent data visually, journalists have to be smart about how they represent data and translate complexities into digestible stories. Our experts laid out a few key principles to remember.

First of all, remember that “there is never enough time on-air to explain everything”, said Petr Kočí. Instead of rattling off statistics, simplify and report the most important trends, using illustrative examples.

“In our experience, audiences respond to things that are tangible, comprehensible, concrete, and especially rankings, e.g. these are the most dangerous intersections for pedestrians, these areas are most affected by drought.”

As an example of effective simplification, Jacques Marcoux shared the New York Times’ (NYT) piece Nine Rounds a Second: How the Las Vegas Gunman Outfitted a Rifle to Fire Faster. The project included a data visualisation, with sound beeps to compare gunfire between the Las Vegas shooting, where the gunman is said to have modified his weapons; a semiautomatic assault rifle; and an automatic weapon. Through these beeps, it’s clear that the Las Vegas gunshot timings are closer to those of a fully automatic weapon.

The NYT used a data visualisation with sonified data points to illustrate the speed of the Las Vegas shooter’s gunfire.

“While it wasn’t published on a traditional radio platform, it serves as a great example for using data to actually simplify (rather than complicate) the concept of a firearm’s rate of fire,” he said.

Getting to more novel techniques, Petr Kočí raised the possibility of experimenting with sensor journalism. You could, for instance, measure the vital functions of an undertrained marathon runner and get a sports doctor to commentate live during the race.

Reveal’s SMS-based story supplements.

Another innovative method, suggested by Adèle Humbert, is using data as ‘extra content’ that listeners can access if they are interested in more information. For example, Reveal has been experimenting with technology that allows audiences to send a text for more data while they’re listening to the story.

The special case of sonification

And then there’s data sonification -- the process of mapping data to produce sounds. Think data visualisation, but for the ears. It’s becoming increasingly used by online journalists as a supplement to graphics and, of course, in radio it’s an interesting way to help listeners experience trends and patterns in a dataset.

Reveal is particularly renowned for its data sonifications, and their Oklahoma earthquake sonification is often singled out as an exemplar in the field. In this project, the team used sound to reveal the state’s extreme rise in seismic activity. Each earthquake was represented through a ‘plink’ noise, with low pitches and loud volume to indicate magnitude. The result is an eerie composition that leaves listeners struck by the extent of earthquakes that have hit the state.

Speculating about why this project was so effective, Jacques Marcoux got us thinking about the connection between sounds and listeners’ real-world sensory experiences.

“In the case of the NYT sonification on the rate of fire of automatic rifles vs. bump-stock rifles, we can all relate to the sound of gunfire. Reveal’s seismic activity in Oklahoma example is strong as well, because we can relate to the feeling of the ground rumbling beneath our feet, say when a train of heaver truck rolls by.”

While listeners may relate to the sensory experience of an earthquake, Michael Corey, who was behind the Oklahoma project, told us that this wasn’t a key consideration for his team.

“When we did our earthquake sonification I wasn’t really thinking of the notes themselves as being related to the sound of earthquakes. In fact I’ve seen another sonification that uses sped-up waveforms from real earthquakes to illustrate a similar phenomenon, and I thought that schtick got in the way a bit of understanding. I was focused on the individual sounds as opposed to the overall picture,” he said.

A snapshot of how Reveal coded Earthquake magnitude into sound.

“You can generally imagine in your head what a dataset might sound like, but really it comes down to listening to the result and playing it for other people who aren’t engrossed in it. They will tell you in about two seconds if it works or not. You should be able to explain in words what it will sound like and what the effect will be, but I wouldn’t veto a piece based on this -- it’s all in the listening. Our earthquake sonification almost never got off the ground because our executive producer -- sorry, Kevin, outing you -- wasn’t sold on the concept. But once we did a test, including a voiceover with the host, it was a no-brainer and he was convinced.”

Sophie Chou, whose gun violence project provides another great example of journalistic sonficiation, also highlighted the importance of audience feedback. Her project, which translates each mass shooting in America into a piano note, uses volume to provoke an emotional response from listeners. The louder the note, the more deaths.

Drawing on this experience, Sophie suggested testing your work in the newsroom or on a small audience. Did they understand what the sounds represent? Was the data being portrayed clear? Were they able to pick up the pattern or point of the sonification? If the answer to any of these questions is no, it’s probably best to simplify your sounds down.

“In both visuals and sound, I think that simpler is always better. The human brain is really good at picking up patterns and melodies, so I actually think it’s important not to tinker with the sound too much to sound melodic, or the listener might just catch up on a pattern and miss the data portrayed,” she said.

As evidence to this point, Michael Corey told us about an early concept for their border wall episode, which didn’t work so well.

“...we tried to do a sonification that showed people that ‘the wall’ was actually many different, disconnected sections of fence, with huge gaps in some places. We pulled off a pretty cool technical feat, I thought, in translating a shapefile into sound, but personally I don’t think the result was legible to the audience. The concept was that you were sort of flying over the border, starting in Tijuana and heading east. If there was fence below you, a melody was playing. No fence, no melody. There was one melody for tall pedestrian fence and another for shorter vehicle fence. We had a bass line below it to keep it moving because of the big gaps some places,” he explained.

‘In the end it was pretty intriguing musically -- our lead engineer and sonification co-conspirator Jim Briggs put a ton of energy into it -- but people internally had trouble following it. It just took too much time to explain the concept, and the pattern wasn’t immediately obvious. It sounded like the hodge-podge that the border wall is, so it was accurate in that way, but not entirely successful.’

His main takeaway: “if it takes you more than a few sentences to explain the concept, it’s probably too complex. And showing people disorder or the lack of a pattern is not going to be a very satisfying experience”.

So, how to create these simple (yet striking) sound patterns? Michael recommends time-series data as a good go-to starting point.

“Human ears are really good at discerning differences in loudness/pitch and in finding patterns, so time series data that’s cyclical can work really well. We’re so hard-wired to respond to music, and there’s a lot more I’d love to do to play with this concept -- hacking our musical brains. I think playing with time is one of the most effective tools -- dead air, pregnant pauses, rapid-fire delivery, etc. That lends itself to time-series data…”

Go forth and get producing

Now that you have an idea of how data journalism can be formulated for audio, it’s time to take that first leap into producing it. Whether you’re a data journalist eager to tell radio stories, or a radio journalist looking to add a data angle, our experts put together some simple tips to set you on the right path.

First, for all the radio-aspiring data journalists out there:

  • “Become friends with audio producers! The best projects come from collaboration. Make sure you know what kind of voice the show or podcast you’re working on is looking for. Put the human voice at the forefront of the story. Think about how your data can create a narrative. And when in doubt, simplify the amount of information you’re sharing.” - Sophie Chou
  • “Doing audio is a different set of muscles than we’re using to working, and you have to commit to working at it...But if you’ve never done audio before, you probably haven’t been cursed with News Anchor Voice, and natural-sounding speaking is the in-demand sound of the podcast era. Consider that a selling point! And in your writing for radio, show don’t tell, and free yourself from writing technically. Just tell a good story.” - Michael Corey

And for our radio-savvy, soon-to-be data journalists:

  • “Data is not a ‘decoration’ for your story. I strongly advise against adding data to a story that doesn’t need it. Data should drive your story idea and help shape it. Is there a beat or topic you cover that you might uncover more stories in if you could obtain certain documents or records? Start from there.” - Sophie Chou
  • “Everyone has learned that data is a good way to sell a story to your editors, but I am highly allergic to ‘sprinkling a little data on that’. Good storytelling answers questions, and I think data analysis should always start with a question. Not ‘what’s in this data’, but what do you really want to know? From there you can get help, teach yourself, or, usually, do both.” - Michael Corey

For both journalist-types, Petr Kočí offered some final advice: “you are two different species, but it's okay, be patient and keep talking to each other”.

]]>
Privacy and data leaks https://datajournalism.com/read/longreads/privacy-and-data-leaks Thu, 02 May 2019 00:00:00 +0200 Susan McGregor Alice Brennan https://datajournalism.com/read/longreads/privacy-and-data-leaks At its most core, the essential work of journalism is to gather and verify non-public information, evaluate its potential value to the public, and then -- if that value is substantial enough -- organise and publish it in such a way that it helps people make informed decisions about their lives. In a very important sense, this means that reporting and publishing around leaked data is no different than any other reporting: Once verified, the question of what to publish, and how, is driven principally by how it can best serve the public good.

Yet both the scale and detail of today's information leaks, especially when combined with a highly networked -- and therefore global -- publishing environment, means that modern leaks present substantive practical and the ethical considerations for journalists. In addition to protecting the source of the leaked information, journalists -- like everyone else -- must work in a digital environment where their activities are almost constantly and ubiquitously tracked, making it all too easy to inadvertently reveal the direction of an ongoing investigation. Moreover, because leaks are now often larger than any one journalist -- or journalistic organisation -- can typically handle, they present unique collaboration and publication challenges, all of which must be carefully engineered to balance efficacy, transparency, and privacy.

Despite these complexities, there are a range of useful methods and heuristics that can help journalists make ethical decisions about using secret, sensitive, or personal data. In fact, the most important -- and sometimes the most significant -- challenge is for journalists to accurately recognise when an ethical situation exists, especially in the fast-paced world of online publishing. For example, both our research and our professional experience indicates that when journalists accurately perceive the sensitivity of the information they have obtained, they handle it with more caution and care. Yet our work also indicates that journalists typically rely on their sources' judgment when assessing the sensitivity of the information they are being given. This suggests that while sources can, in general, trust journalists to handle leaked data carefully, it may also lead to problematic oversights when journalists report on data that has been simply ‘dumped’ online, especially since the sensitivity and/or significance of that data may not be immediately obvious.

Wikileaks has faced criticism for its data dumps, which have at times left sensitive and personal data exposed.

In our view, the mechanism through which data is obtained does not change the ethical standards with which it should be treated. If anything, sensitive information made accessible by a hack or a leak deserves more careful handling, since the agendas of those that have made the information public are uncertain at best. Moreover, the smash-and-grab data collection methods that are typical of hacks and breaches virtually ensures that the information of many private individuals -- whose only failing, often, is having been in a database along with someone or something of interest -- will be swept up along with anything potentially newsworthy. As such, journalists need to take particular care that their work does not implicate or injure those who are only 'guilty by association’. It is because of this, in part, that ‘leaked’ data may in fact demand more thoughtful handling than material provided by a confidential source.

Hacked, breached, or leaked: key considerations when reporting with private or personal data

As we noted above, journalists can typically be relied upon to think carefully about how they handle sensitive data provided by a trusted or confidential source, in part because a human source will often impress upon the journalist the risks of mishandling the information. The journalist's desire to preserve valuable source relationships (and her own reputation) may also help counterbalance the impulse to publish material that may be more salacious than newsworthy. When data is simply 'dumped' online with little or no context, however, this source-oriented conscientiousness may go out the window. Combine this with the pressure journalists may feel from editors or competitors to get a story out quickly, and it can be hard to justify the delay required to weigh how significant the story in leaked data set actually is.

To help illustrate how the lack of a ‘human’ source -- especially when coupled with the exciting and even illicit nature of leaked data -- can raise crucial ethical challenges, we'll look at the issues presented by the data made public through two prominent hacks: the Sony email hack and the Ashley Madison hack.

In late November 2014, a massive dump of internal data from Sony Pictures was posted on the data-sharing site Pastebin, following a months-long hack of the entertainment company's systems. The hacked data contained everything from email exchanges to personnel files, and the question of what motivated the attack -- as well as who had perpetrated it -- quickly dominated headlines across the news spectrum. Yet many of the articles examining the contents of the leaked data also leaned toward the tabloid, focusing on executives' nasty exchanges and celebrity name-calling.

A snapshot of the Guardian’s reporting on the Sony email hack.

Although many may find it hard to muster sympathy for these powerful and high-profile individuals, the coverage's focus on the machinations of Hollywood dealmaking and franchise evolutions also threatened to overshadow many of the substantive issues revealed by the documents, such as industry-wide coordination on lobbying efforts designed to reshape the way that online content is served, or the fact that the data revealed the social security numbers, home addresses, and salaries of tens of thousands employees.

The emphasis in the total coverage of the Sony hack, moreover, stands in stark contrast to coverage of similarly controversial leaks like the Snowden documents, where the focus has remained on the actions of powerful companies and nation-states, rather than on the foibles and gaffes of the individuals involved.

Less than a year after the Sony Pictures hack, another high-profile hack and data dump made particularly intimate details of a large number of people's lives essentially public online. In this case, however, the individuals whose information was posted were not celebrities or Hollywood dealmakers, but people from all walks of life who had joined an online dating site purportedly designed to facilitate extramarital affairs.

The Ashley Madison website offers a dating service for married individuals.

Although the Ashley Madison hack generally contained less personally identifiable information than the Sony Pictures hack, the ramifications of the breach for those affected were sometimes devastating: multiple suicides were attributed to the hack, with widespread blackmailing campaigns, lost political careers, marriages, and community relationships resulting from the fallout. While there was some substantive reporting to be done on aspects of the leaked data -- for example, on the potentially inappropriate use of the service from government offices, and assessment of the fraud claims that had been made by former users -- reporting that simply plucked individuals out of the Ashley Madison databases and treated them as 'sources' for additional comment may well have done more to traumatise people who had already been victimised by the breach in the first place. And while there is no doubt that accountability reporting can sometimes have negative consequences for those whom it covers, as journalists we must make every reasonable effort to ensure that those consequences are reserved for those legitimately suspected of actual wrongdoing, and not simply on people whose choices may differ from our own.

Inverting implicit biases

A key component of the ethical challenge presented by cases like those above stems from our need, as journalists, to confront our own implicit biases. Whether or not we find the subjects of leaked data likable or even sympathetic, we must carefully weigh the news value of reporting with leaked data and the privacy interests of the people we are reporting on. As we’ve discussed, this can be particularly difficult to do when the data in question is simply posted online, since these datasets lack a human source to remind us of the potential ramifications of publishing. Importantly, this loss of context often also obscures the motives of the people who obtained and/or posted the hacked data in the first place -- something that should arguably be a core focus of the reporting on it.

By their very nature, of course, our own biases are difficult to counter. Consulting with colleagues and editors is always a good place to start sanity-checking our first judgements. Another strategy is to use a simple thought experiment: If the contents of the leak were something that we personally believe should be private’ -- perhaps individuals' HIV status, for example -- would we still report on it, and how? Especially when we consider that most of us are unwilling subjects of data collection in the first place, it's important that journalists consider how to minimise additional harm to any individuals whose personal information they are working with -- no matter how it came into their hands.

At times, however, reporting with sensitive and/or personal information is unavoidable, and may be essential to an important piece of accountability journalism. Where that is the case, there are still a number of ways that journalists can do the reporting and publishing that they need to while minimising potential harm as much as possible. Though today's networked data environment means there are few guarantees to be had, a clearly defined and thoroughly considered process -- especially one conducted in consultation with experienced colleagues -- can help journalists be confident in the appropriateness of their reporting and publication choices when dealing with hacked, leaked, and otherwise sensitive data.

Considerations for reporting

Good security helps ensure privacy protections

While we are unabashed advocates for strong information security practices, a key reason for this is the protections that they allow journalists to provide for their human sources as well as any potentially sensitive data resources that they may have. Although many companies affected by data leaks and breaches have far more resources to dedicate to security, journalists have demonstrated a unique ability to protect information. The first step in treating personal data ethically is to take reasonable precautions to ensure that it does not slip out of your control. A great resource for beginning to enhance your own security know-how is the Electronic Frontier Foundation's Surveillance Self-Defense website, which has everything from a security ‘starter pack’ to guides on particular tools.

There are many resources available online for journalists to start building up their security capability.

Take care when verifying

While many appeared to relish The Intercept's apparent missteps when verifying documents allegedly provided to them by NSA contractor Reality Leigh Winner, it can be hard to know what information may tip off interested parties as you verify elements of a story.

If you are relying on web searches, consider using a location-masking browser like Tor to make it more difficult for online service providers to infer what you are researching. If you are relying on human sources, consider what you know about the provenance of the data you are dealing with. If it is genuine, ask yourself: What is the likely position of someone who would have access to it, and what might lead another person to guess where it came from? Use your answers to guide what information you share with whom when verifying.

No matter the circumstance, however, it is always wiser to avoid sharing original documents. Instead, retype segments of content (correcting obvious spelling and punctuation errors) that you need to reveal -- many organisations will distribute sensitive documents with unique typographical or formatting features, so that they can pinpoint the source of an internal leak if originals show up online.

To protect their source, Axios retyped the content of leaked White House schedules.

Finally, it's always a good idea review documents (especially PDFs, .doc/x files and spreadsheets) on an old computer that will never be connected to the internet (use a thumb drive to move them there). While this may seem cumbersome at first, it also helps protect your other information (and your organisation) from the malware and viruses that leaked data may contain. This is especially true if you plan to print hard copies of documents -- the ‘enable editing’ permission required for printing can also put your own computer's data at greater risk.

Considerations for publishing

How much is enough?

Providing reasonable privacy protections can sometimes seem at odds with imperatives around both accountability and transparency. When it comes to publishing data, however, the choice is not ‘all or nothing’. In fact, digital publishing gives journalists a range of ways to strike a balance between protecting private data and being transparent about their work.

Naturally, any data to be published must first be verified; this alone will limit the incidental exposure of personal information, since verification -- in addition to being a hallmark of responsible journalism -- is incredibly time-consuming.

Once verified, there is a question of relevance: Is the personal information you plan to publish essential to the story, or not? This can be a difficult question to answer. But just as we should reflect on why we might include information in a story about someone's age, race, immigration status, or other demographic attributes, we should relevance-test personal information we intend to publish. In short: Does the story really need it? This is especially important to consider given the ripple-effect of revealing personal information about someone. Family members, work colleagues, and -- in this age of social media -- even casual acquaintances may be affected by the revelation of your subject's personal details. Moreover, since most journalism today is inevitably published into a global context, the norms to consider when publishing personal details are not confined to a single region or culture. In general, you should be as conservative as possible without sacrificing the integrity of the story. Only after doing a thorough risk-benefit analysis, which keeps the costs to the individual at the center of your reasoning, should you make a determination about what to publish.

Balancing privacy and transparency

Just as journalism seeks to hold power to account, journalists should make themselves accountable to the public as well. Where possible, this means sharing data, sharing code, and providing detailed methodologies for the stories that you produce.

Yet, in many cases, wholesale publication of data or documents may violate the privacy of innocent people, or even put them at risk. In these cases, there are a number of methods that journalists can use to make key information public.

Redaction

Tools like DocumentCloud allow journalists to both publish documents and retain fine-grained control over them, offering both redaction and annotation tools. Journalists hoping to redact information from documents are advised to do so with tools designed for the purpose, as some methods (for example, drawing black boxes in Adobe, or ‘hiding’ Excel columns or rows) are easily undone. It's also important to keep in mind that simple methods of de-identification (such as removing names and addresses) are often insufficient to protect people's identities, given how much information is available to cross-reference online.

An example of a redacted CIA document. Source: Wikimedia.

Is the personal information you plan to publish essential to the story, or not?

Samples and summaries

Another way that journalists can help protect individuals' privacy when publishing private data is to publish only a curated sample of data, or to publish data that has been summarised to the point that re-identification will be difficult, if not impossible. As above, however, the vast quantity of information available online means that both samples and summaries must be carefully designed to avoid revealing more than intended. We suggest consulting with a statistician if possible to help ensure that the measures taken are sufficient. A good place to start is to remove any obviously identifying information combinations, and then using the fully reported story to help guide your thinking about what samples or summaries will further clarify that story without inadvertently exposing individuals' information.

Visualisation

Visualisation is itself often a way of aggregating data in such a way that meaningful patterns are revealed in otherwise heterogeneous datasets. An advantage of visualisation for privacy protection is that a very small subset of data features are typically needed to create a visualisation, and, if well done, they add real value to the story being told. Static graphics, in addition to being more platform-friendly, also naturally limit the amount of inference and post-publication manipulation that is possible.

When the Journal News published an interactive map of gun permit holders following a school shooting in 2012, many were outraged by the apparent privacy invasion.

In 2011, the Guardian was careful to publish maps about the UK Riots such that the underlying information was not accessible, out of concern for the type of backlash that the Journal News ultimately suffered. Both of these maps are now currently offline.

Maps published with the Fusion series Suspect City illustrate where and how different age groups were stopped by police in Miami Gardens, Florida. Because many of those stopped were vulnerable or underage, these visualisations show the patterns of stops without revealing other personal details.

Selective release

Sometimes a journalist may find themselves in possession of a unique trove of data that they simply do not have the resources to investigate as thoroughly as they would like. Happily, however, digital technologies remove the requirement to choose between a ‘data dump’ and keeping information entirely to yourself. For example, academic researchers are increasingly using simple contracts to support useful data sharing while also helping protect their data from misuse. Similarly, during the initial phase of the Panama Papers reporting, journalists working on the project were required to sign contracts about how they would handle both the data and their reporting processes -- an approach that succeeded in keeping the work safeguarded until the stories were ready to publish.

If there are reasons that data cannot simply be published, journalists can still indicate that they are open to data-sharing requests. While this barrier alone may be sufficient to deter many bad actors, requiring anyone wishing to use the data to sign a simple contract agreeing not to use or disclose it in certain ways offers an additional layer of protection, as they could be held legally liable if they fail to protect the data as agreed. Although this approach obviously involves some risk, it often provides a good balance between protecting sensitive or personal information and allowing responsible parties to hold journalists to account.

Conclusion

As the breadth and complexity of the broader data environment continues to grow, so, too, do the ethical challenges around reporting and publishing with such data. While the core considerations of news value and the public interest help to answer questions about what journalists should cover and how, the nature and scale of digital leaks and digital publishing have introduced new ethical issues that often need to be examined. However, just as with many other processes in digital journalism -- like verification -- creating a thoughtful, well-defined process for evaluating leaked data and deciding how it will be handled goes a long way to ensuring that your reporting efforts are not only efficient, but ethically sound.

For more on privacy and data journalism:

]]>
Spreadsheets for journalism https://datajournalism.com/read/longreads/spreadsheets-for-journalism Thu, 04 Apr 2019 17:29:00 +0200 Brant Houston https://datajournalism.com/read/longreads/spreadsheets-for-journalism It is still the easiest laugh to get from a group of journalists -- professionals or students -- throughout the world. All that needs to be said is: “We all know you got into journalism to do math”.

The laughs come because most journalists have seen themselves primarily as storytellers and word artists. For them, numbers are worrisome, boring, or interfere with the flow of an article.

Furthermore, the perception of journalists’ inadequacy and inability to deal with numbers has been increased by mathematicians and statisticians over the decades. They’ve pored through newspapers and websites, listened to radio, and watched broadcast news with the purpose of finding errors and ignorance whenever journalists have reported on numbers.

The book, A Mathematician Reads the Newspaper, by John Allen Paulos was relentless in its pursuit of journalists’ mathematical misfortunes. That book followed his previous book, Innumeracy, which was broader in its criticism of math impairment across many professions, but included journalists among those wrongheaded.

Paulos’ books and other statisticians’ criticism implied that journalists hate numbers and can’t do math and, perhaps, never will. But that has become untrue over the past two decades. This change can be linked to the surge of journalists using data and to the self-realisation that they often do some kind of math every day. Whether it is deciphering budgets, examining salaries, or looking at accident or murder rates, most journalists these days are constantly counting and comparing numbers.

Certainly, up until the 1980s, it was the rare journalist who understood the difference between mean and median, could calculate a percentage difference, or do a simple rate. At the Kansas City Star, where I worked in the 1980s, there was a copy editor who knew how to do percentage difference by paper and pencil and he sometimes had a small line of reporters at his desk waiting for him to do that calculation for each of their stories.

Brant Houston is an award-winning journalist, journalism professor and author. He was an investigative reporter in US daily newsrooms for 17 years and was executive director of Investigative Reporters and Editors for more than a decade.

A major example of innumeracy over the years was that news stories would favour sports team owners -- without realising it -- during labour negotiations between owners and players. These stories would cite the average salary of players rather than the median, thus letting the huge salaries of a few star players inflate the average. If the reporters had used median, they would have seen how few players made the average and the perception that all players were millionaires was false.

In other instances, journalists would report that it was fair for workers to get the same percentage increase in wages, without realising that a 3% increase for someone making $150,000 (it’s $4,500) is much greater than a 3% increase for someone making $30,000 (it’s $900). Journalists would also fail to use rates for putting raw numbers in perspective. One city would be called the murder capital of a country based on the total number of murders, despite having a much lower murder rate than other cities. A road intersection would be deemed the most perilous based on total number of collisions, rather than the rate of collisions compared to traffic.

Always check the mathematics behind ‘murder capital’ claims -- sometimes it isn’t as robust as it seems.

An intersection that has a hundred collisions a year, when the traffic through it is 100,000 cars a year, is less riskier than an intersection that that has a hundred collisions a year with only 10,000 cars passing through it in the same year.

But the public shaming of journalists who made mathematical errors left reporters, as the long-time journalist and top data journalism instructor Sarah Cohen wrote in her book, Numbers in the Newsroom, with "the impression we can't use any numbers without fearing retribution". (Cohen's book is an invaluable guide on journalism and math.)

Yet it was in the late 1980s that a small band of journalists began to embrace the power of accurate numbers and calculations as they began to work with data. They also discovered the spreadsheet. And, inspired by Philip Meyer's book, Precision Journalism, they came to see the power of math and numbers, rather than scorning or avoiding them.

Data journalism workshops given by Meyer at the University of North Carolina and by Investigative Reporters and Editors with its companion organisation, NICAR, drew hundreds of journalists eager to learn data analysis. At those workshops and then at NICAR conferences, they received training that included math -– training that was seldom, if ever, offered in classrooms for journalists or newsrooms. In fact, journalism professors wanting to keep up with the profession attended those workshops and became the few including math and numbers in their classes.

In those workshops, I and my colleagues found the previous teaching of math had lacked the appropriate approach and perspective. The best approach demystifies ’math’ and focuses on the basics that allow journalists apply math in a practical way –- that is, to summarise numbers, put them in context, and determine if the numbers are misleading or lies.

The result of the workshops -– which spread globally -- was an increased understanding of numbers and thus the ability to write more lucidly about those numbers. Numbers were not boring if they revealed shocking ethnic disparities, large numbers of failing bridges, or alarming rates of murder.

It was clear it was much easier to deal with numbers if the teaching led to that immediate illumination about a topic.

A screenshot of Visicalc -- the first spreadsheet that combined all essential features of modern spreadsheet applications. Credit: Wikimedia.

In addition, the use of spreadsheets, be it Microsoft Excel or Google Sheets, made the math easier because journalists could rely on automatic calculations once the numbers were entered in. That also increased journalists’ confidence in interpreting statistics and surveys, as well as encouraging them to employ more advanced statistical methods.

One manifestation of this change in journalism can be seen in the number of websites and news columns devoted to the interpretation of numbers. Among the places devoted to numbers: a regular Saturday column by Jo Craven McGinty on numbers in the Wall Street Journal, the Upshot column in the New York Times, and the FiveThirtyEight website by Nate Silver.

Another manifestation is the Philip Meyer Awards, international awards given by Investigative Reporters and Editors, that recognise the best uses of social science in journalism. Year after year, since 2005, these awards show the progress that has been made in the field's numeracy. For example, an investigation by Bayerischer Rundfunk and Der Spiegel, No Place for Foreigners. Why Hanna is invited to view the apartment and Ismail is not, revealed discrimination against foreigners in the German housing market through a large-scale survey of landlords. They found that potential renters with Arab and Turkish names were frequently ignored.

This investigation drew heavily on number-based experiments, like the difference in chances pictured above. Read the full piece here.

In another award winner, Buzzfeed and BBC used a million simulations of tennis matches to discover suspicious patterns in the shifting of betting odds and players who lost matches they statistically should not have lost.

In the US, journalists at several newsrooms have shown widespread cheating on standardised tests by showing that test scores were way too high based on analysis of previous years' scores. In a similarly math-based investigation, ProPublica uncovered a disturbing trend: temporary workers are hurt up to six times the rate of permanent employees, and their injuries are more severe.

And there are many examples of smaller but effective stories using numbers. Years ago, a reporter, who had just received training in spreadsheets, found the city she covered had uniformly miscalculated percentage changes in its annual budget. Some reporters found political associates in governments receiving much larger salaries than regular employees. Others calculated serious cost overruns in government programmes.

The spreadsheet as the basic, starter tool

With just a spreadsheet, a journalist can let the software do the counting and calculating, allowing them to concentrate on the purpose and result of their inquiry. It also opens the door to understanding more advanced statistics, and the use or misuse of statistics by governments and businesses.

The mathematical tools in a spreadsheet can be divided into two groups: data management and calculations.

Data management, in which the counting is automatically completed within the spreadsheet, includes:

  • Filtering data based on a criteria
  • Sorting to bring meaning to numbers by looking at them from high to low or low to high
  • Summarising by grouping topics into categories, and summing or counting the numbers associated with each category

Important basic calculations, some of which can be automatically executed and some which must be performed by the journalist, include:

  • Summing up a column or row of numbers
  • Determining the mean or median of a column
  • Calculating percentage difference
  • Calculating a rate
  • Calculating a ratio

Data management

Let’s begin with filtering. There’s a recreational boating accident database in the US that has details of accidents that led to deaths. Here is a sample of that data, which is probably collected in many other countries.

An abbreviated version of the recreational boating accident database kept by the US Coast Guard.

By using the filter function in a spreadsheet, a journalist can quickly answer the following question: How many persons died of drowning who were not wearing a life jacket (PFD – Personal Flotation Device) and could not swim? It turns out that nearly two-thirds of the drowning deaths include people who could not swim and did not wear a life jacket. A slice of the data appears below.

This shows only the accidental boating deaths caused by drowning and in which the victim was not wearing a life jacket and could not swim.

All the journalist has to do is click the filter icon, picking one of the scroll arrows in a column, and choose the criteria. The journalistic value of using this tool is immediately clear because the reporter now has a story showing that some of the deaths could have been preventable.

Simply sorting numbers can bring meaning to them, or it can take the political spin off them.

For example, the World Health Organisation issues an annual report on the healthy life expectancy of males and females in each country. The annual report is issued with the countries listed alphabetically. (Below is a simplified version of the data created by eliminating some of the columns of information.)

An abbreviated version of the Healthy Life Expectancy database from the World Health Organization.

Sort the countries by the highest life expectancy to the lowest, and you can see the biggest differences –- potentially the start of a story on why some countries are higher and some are lower. This is done with a simple calculation of subtracting the life expectancy of males from females, and sorting by that difference.

The Health Life Expectancy data with the calculated difference between male and female ages sorted by the largest difference to the smallest.

As you can see, many of the largest differences are in countries that were a part of the former Soviet Union. Again, this could be the start for an illuminating story on why that is.

Grouping numbers in categories and counting or summing them (or both) can give a valuable overview of a dataset. A spreadsheet has an excellent tool for summarising, called a Pivot Table. Let’s have a look at how this tool can help discover which retailer sells the most guns in Missouri.

By clicking on the Insert tab and then on the icon for the Pivot Table, journalists can choose to count by the numbers of licenses a business holds.

The Pivot Table icon has been selected in the left hand corner.

The Pivot Table allows you to count the number of each business with licenses by choosing from a list in a selection screen.

This pivot table shows the number of dealerships licensed under a unique business name.

Sorting from high to low based on number of licenses, it’s possible to see that the corporation Walmart has the most licenses to sell guns in Missouri.

This data shows the number of licensed dealerships by unique business name sorted by largest number to smallest.

In these examples of data management, the journalist only has to do one calculation: subtraction (in the healthy life expectancy dataset, where male ages are minused from female ages). The software does all the other counting and arranging.

Calculations

Journalists can rapidly total columns of numbers by using the formula or icon for summing a column.

The icon in a spreadsheet is one way to do a sum, but if there are blank rows it is better to put in the specific range of numbers. Here is a list of salaries at an imaginary government agency.

The icon can be used for one group, but because of the blank row it will stop at that group unless the range is dragged upwards. It is easier to do this calculation =sum(b2:b9) than worry about missing a row when specifying the range.

This is a fictional dataset on municipal salaries earned by political appointees with the total salaries added for the previous year.

The brilliance of a spreadsheet is that it maps the data, which allows formulas to calculated and copied easily. Instead of doing calculations with numbers, journalists can use the ‘addresses’ of the numbers.

The mean is often known as the average and, in fact, spreadsheets use the word average for the calculation. But be wary: means can obscure the effect of a large number on the average – such as a CEO or a team’s superstar – or of a small number – such as a group of lowly paid workers. A median, in which half the numbers are higher and half are lower, can serve as lie detector and can correct for those ’outliers’.

For example, a team of five athletes has one star and four regular players.

If the average is calculated with the formula =AVERAGE(b3:b7), then the average salary is $158,000, thus making it appear that most players are making $158,000.

A fictional dataset of the game salaries for professional basketball players.

However, the median with the formula =MEDIAN(b3:b7) shows that the median salary is $50,000, which is a much more accurate indication of what most of the players are making. By reporting only the average, a journalist would mislead the audience into thinking players are making much than they are.

Calculating a percentage difference is one of the most powerful tools a journalist can use. It puts numbers in proportion. For example, a journalist might want to look at the impact of salary raises on individuals at an agency. In the two columns in the agency worksheet, last year’s wages and this year’s wages are listed. As seen below, the calculation of percentage difference is not $7,000 (the difference) divided by the previous salary ($45,000), but rather the formula =D2/B2. (The = sign is needed for any formula.)

So, to calculate a percentage difference, last year’s wage is subtracted from this year’s wage. Then the difference is divided by last year’s wage. With this calculation, the actual impact on each worker is seen. These are not the usual raises, of course, but fictional ones given to a politician’s associates.

This dataset shows the percentage difference calculated between last year salaries and this year’s salaries.

Percentage difference is used in many reports -- budgets and trade, for example -- to show both raw numbers and how they compare to each other.

Rates are used throughout the world, whether they are for traffic accidents, mortality, crime, or many other issues. Rates are used so that more fair comparisons can be made between categories, often addressing risk. For example, one city could have 600 murders a year and another could have 400 murders a year. But if the population of the city with 600 murders a year is much larger, then the murder rate is much lower and, thus, the risk of being murdered is much lower. (Crime rates can be more complex than this, but this is a frequent use of rates.)

Actual data from the Federal Bureau of Investigation with the murder rate calculated for the larger cities in the US.

A rate is calculated by thinking of the number of incidents per population (which could mean people, number of vehicles if its traffic, and so on). In the case of murder rates, it would be number of murders divided by the city population. In this example, the US city with the most murders (Chicago) does not have the highest murder rate.

Ratios are numbers that are extremely useful when writing about numbers. It can be much more concise to write that one number is double that of another, rather than it is 100% higher. It can be quite startling to find one group of people is jailed twice as often as another, or a pharmaceutical drug has a success rate three times higher than another.

For example, one ethnic group has 8000 persons jailed each year. Another group has 4000 persons jailed each year. By using the formula =8000/4000 the ration is determined to be 2 to 1, or double that. If the first ethnic group makes up only 10% of the total population, then a journalist has the beginning of an inquiry to answer why.

For more on using spreadsheets, check out our video courses Doing Journalism with Data: First Steps, Skills and Tools and Cleaning Data in Excel. You can also start a conversation in our forums.

Conclusion

These basic functions and calculations allow journalists to overcome a fear of numbers and to leap into using math for stories. If the growth of data journalism is anything to go by, the adoption of these new skills benefits both the field and newsrooms, with inquiries that are more accurate, use better comparisons, and give greater context. It all adds up to what every reporter strives for: meaningful and insightful journalism. And that is no laughing matter.

]]>
Putting data back into context https://datajournalism.com/read/longreads/putting-data-back-into-context Thu, 04 Apr 2019 17:02:00 +0200 Catherine D'Ignazio https://datajournalism.com/read/longreads/putting-data-back-into-context What happens when an institution collects data about something in the world? The origin of the word data actually means 'that which is given'. And this is typically how newcomers regard data – as a somewhat neutral recording of facts at hand. The information was there, and then an institution collected and stored it. When data journalists investigate an issue, we look for who might have data, how we can acquire those data, and then use them to create new insights into the world.

But the scholar Johanna Drucker proposes a different word for data: capta. By this, she means, 'that which is taken'. As Johanna states in her paper, Graphesis: Visual knowledge production and representation, "Data are considered objective 'information' while capta is information that is captured because it conforms to the rules and hypothesis set for the experiment".

This distinction might seem academic for data journalists, but in fact it's at the root of why context matters so deeply for data journalism. Thinking of data as capta invites us to consider why an institution invested their resources in collecting information, how the institution uses that information, who the information benefits (and who it doesn't), and what the potential limitations of the information are. In short, it points us back to how data are never neutral 'givens', but always situated in a particular context, collected for a particular reason. In Lauren Klein and I's book, called Data Feminism, we devote an entire chapter to the importance of considering context, particularly when the collection environment has any kind of power imbalances.

Catherine D'Ignazio is an Assistant Professor of Data Visualization & Civic Media at Emerson College and a research affiliate at the MIT Center for Civic Media and the MIT Media Lab.

Why context is hard

Establishing and understanding the context of your data (capta) is likely one of the single most challenging aspects of doing data journalism. It's like starting out with the leaves of a tree and then trying to connect them back to their branches and roots. But why is context so hard?

First of all, data are typically collected by institutions for internal purposes and they're not intended to be used by others. As veteran data reporter Tim Henderson, quoting Drew Sullivan, said to the NICAR community, "Data exists to serve the bureaucracy, not the journalist". The naming, structure and organisation of most datasets are done from the perspective of the institution, not from the perspective of a journalist looking for a story. For example, one semester my students spent several weeks trying to figure out the difference between the columns 'PROD.WASTE(8.1_THRU_8.7)' and '8.8_ONE-TIME_RELEASE' in a dataset tracking the release of toxic chemicals into to the environment by certain corporations. This is not an uncommon occurrence!

And while the open data movement has led to governments launching more open data portals and APIs, these efforts have prioritised publishing data over publishing the metadata that would actually make the data more useful to outsiders. Part of this is cost-related -- context is expensive. The cities of Boston and Chicago both had to secure external grants from foundations in order to embark on comprehensive metadata projects to annotate the columns of the open datasets on their portals and make their datasets easier to search and find.

But sometimes, the lack of attention to usability, context and metadata works in favour of the collecting institution, which may have reasons for why it doesn't want certain information to become public. For example, the Boston Police Department (BPD) runs a programme called Field Interrogation and Observation (FIO). For all intents and purposes, this is a stop and frisk programme, in which police log their encounters –--observations, stops, frisks, interrogations -- with private individuals on the streets. In 2014, following a lawsuit won by the American Civil Liberties Union, the BPD was obligated to release their FIO data publicly on Boston's data portal. But when you search for ‘stop frisk’ on the portal, nothing comes up. Journalists and members of the public would need to know the bureaucratic term for the programme (FIO) in order to be able to locate it on the portal.

Searching for ‘stop frisk’ in the City of Boston's open data portal yields no results. Those searching for data would have to know that the programme is called ‘Field Interrogation and Observation’.

Furthermore, some institutions may publish their data and metadata freely, but be less forthcoming about their data's limitations. This can lead to serious cases of misuse and misrepresentation of data. In one chapter of Data Feminism, The Numbers Don't Speak for Themselves, Lauren Klein and I discuss the case of GDELT: the Global Database for Events, Language and Tone. In a high-profile correction, FiveThirtyEight had to retract a story about the kidnappings of Nigerian girls that used the GDELT database. They had mistakenly used media reports about kidnappings to tell a story about the incidence of kidnappings. While FiveThirtyEight should have verified their data before publishing, we describe how GDELT, because of their pressure to attract big data scientific research funding, failed to describe the limitations of their ‘events’ data (which is not events data at all, but rather ‘media reports about events’ data).

The Three-Step Context Detective

So, what's a data journalist to do? She has to become a ‘context detective’, working with data that have been captured from the world into spreadsheets and databases, and connecting them back into their collection environment. This work is similar to that of a detective -- the journalist has to use incomplete clues that point backwards to the bureaucratic functionings of the collecting institution.

To understand the data, you must understand the who, what, when, where, why, and how of that bureaucracy. Below I present a process called the Three-Step Context Detective, which I use in the classroom. These steps don't necessarily have to be completed in this order.

1. Download the data and get orientated

Looking at hundreds, thousands or hundreds of thousands of obliquely named rows and columns can be daunting at first. Sometimes newcomers think that the data science ‘wizards’ can just look at a spreadsheet and see a story emerge. Nothing could be further from the truth. Getting oriented in your dataset involves breaking down the basics systematically so that you can ask good questions.

You can use a programme like Excel or Google Sheets to do basic exploration to answer questions like:

  • How many observations (rows of data) do you have?
  • How many fields (columns) do you have?
  • Is it clear what each row is counting? (Remember those incidences of kidnapping versus media reports about kidnapping – getting crystal clear about what your data is logging is supremely important.)
  • What is the time period of the data? Use the ‘Sort’ function on any column with dates or timestamps to see when the data begin and when they end.
  • What is the geographic extent of the data?
  • Does there appear to be a lot of missing data?

For more on using spreadsheets, check out Brant Houston's article Spreadsheets for journalism, or our video courses Doing Journalism with Data: First Steps, Skills and Tools and Cleaning Data in Excel. You can also start a conversation in our forums.

This process of getting oriented with a new dataset is no small task. In fact, Rahul Bhargava and I built a free, online tool called WTFcsv which captures the emotion that journalists often feel when looking at a new spreadsheet: "WTF is going on with my csv file?!" Using WTFcsv, you can continue your orientation process.

WTFcsv analyses each column from a spreadsheet file and returns a data visualisation that summarises the patterns in each column. For example, the image below depicts data about passengers on the Titanic. The column ‘Sex’ is rendered as a column chart that demonstrates, visually, that there were 314 female and 577 male passengers logged on the Titanic.

Rahul and I talk about the importance of asking good questions of your data before you conduct analysis or tell stories. WTFcsv can help you answer all of the basic questions above, as well as start to form your own questions about the dataset in front of you. For example, for the Titanic data, good questions might be about ethics (‘Why is 'Sex' a binary variable?’), data formatting (‘What does the 'Parch' column mean?’), data quality (‘Is this data complete?’), or data analysis (‘Did women survive at a higher rate than men?’).

It's important to write down all of these questions, because as you go through the next couple steps, you can try to answer them.

2. Explore all available metadata

Metadata is data about data and can be your golden ticket to establishing context for a dataset. In an ideal world, any dataset that you download would have a detailed and up-to-date data dictionary. This is a document that provides a column-by-column description of the dataset, along with guidelines for using it.

In the above example, from the NOAA National Database of Deep-Sea Corals and Sponges, each field (column) in the dataset is annotated and explained, along with descriptions of data quality, units of measurement, completeness, and usage guidelines.

Seeking metadata is not always easy or successful. Not all data providers produce data dictionaries. In 2016, journalist J. Albert Bowden sought documentation on the fields in a dataset from the US Department of Agriculture. He was told that explanation of their column headers was a proprietary secret. Moreover, even when there are metadata, providers might fail to call the file ‘data dictionary’. For example, if you use the City of Boston's 311 data, the data dictionary is called ‘CRM Value Codex’ – not the most attractive and user-friendly name ever.

And sometimes data dictionaries or other forms of metadata might be outdated because the institution fails to update them when the dataset changes. It's important to have your sceptical, fact-checking, data-verifying journalist hat on at all times.

3. Background the dataset

Journalists often ‘background’ a person or ‘background’ an issue, and likewise the final step of the Three-Step Context Detective is to background your dataset. This may be the most time-consuming aspect of establishing context for your data, but it is well worth the investment in terms of understanding limitations, preventing errors, and discovering newsworthy stories and analysis. In this process, there are at least three separate things to conduct background research on.

Background the collection process

Newcomers (as well as old-timers!) can forget that data collection is often a human, material process. Data science consultant Heather Krause advocates for creating data biographies that describe where the data came from, who collected them, and how they collected them. In the case of the Boston stop and frisk programme discussed above, police officers fill out paper forms after having an encounter with a resident on the street. Then, those forms get turned into the precinct and a staff member logs the values in a database. Before publishing to the website, other staff members remove personally identifying information. It's all very mundane, but absolutely essential to understanding where errors and missing data can be introduced. The meat is in the bureaucratic details like whether data is self-reported, or observed, or measured by a machine. Like what database the organisation uses to store the data. Like whether the way the organisation is counting and measuring has changed recently (making current data and prior data not comparable).

After exploring any available metadata, your best path to backgrounding the collection process is finding a human being to talk to. This might be someone from the collecting organisation, but it's important to think creatively about interviewees when this is not feasible.

After exploring any available metadata, your best path to backgrounding the collection process is finding a human being to talk to. This might be someone from the collecting organisation, but it's important to think creatively about interviewees when this is not feasible. For example, in the case of stop and frisk data, it's hard to get police spokespeople on the phone. Other potential interviewees might include the youth of colour who are disproportionately stopped, the ACLU who sued the police department and did their own background research on the collection process, or criminal justice scholars who have studied the Boston programme. Krause has a helpful data biography template that you can fill out for this process.

Background the organisation

While we have talked a lot about understanding datasets, Yanni Loukissas makes the case in his book, All Data Are Local, that to use data effectively we also need to understand data settings. Backgrounding the dataset is not just about the data itself – it's also about understanding why an organisation was motivated to collect it in the first place, as well as how they use it.

In the case of the stop and frisk data, this means doing background research on the Boston Police Department: What is their mission? How long they have existed? What is their budget? How many officers are there? When have they been in the news in the last ten years and why? It also means researching the FIO programme specifically: When and why did the BPD start the programme? Was it part of a national wave of FIO programmes? Is there scholarly and legal debate on whether these programmes are constitutional and effective in reducing crime?

From this understanding of the underpinning organisational and programmatic goals, it's helpful to try to understand how the organisation uses the data it collects internally. For example, does the BPD use its logs of police-civilian encounters to try to limit racial profiling? Do officers have quotas? Who does the BPD have to report their numbers to?

Here, again, interviews with real, live human beings are going to be one of the most effective ways of getting information about the organisation's motivations and uses of the data it collects. But when you can't get an inside interview, one of the best ways to find this information is to:

Background the regulatory environment

Data is expensive to collect, maintain, and organise. Most organisations don't collect data because they want to but rather because they have to, either to comply with laws or with internal policies. Doing background research on the regulatory environment can often shed light on why the organisation collects data, who it reports that data to, and how it reports the data. For example, all accredited higher education institutions in the US have to collect and report data about sexual assault on college campuses because of the Jeanne Clery Disclosure of Campus Security Policy and the Campus Crime Statistics Act (Clery Act).

It is typically easier to background federal and state laws, which tend to be available publicly or identified via talking with lawyers and others with legal knowledge. Internal policy documents that guide data collection can be harder, albeit not impossible to access. If you live in a country with public records laws, using those laws to request organisational governance documents and training manuals can be an excellent way to understand the internal regulatory context that guides information collection. As an example, most police departments in the US collect data on the use of force by police officers against civilians. Knowing this, when a white male police officer used excessive force at a pool party in 2015, reporters at MuckRock made a public records request for the McKinney police use of force policy. On page nine, the policy details when and why officers are required to file a 'Response to Resistance' report (RTR) and who is responsible for maintaining those reports. This policy would be essential background information for any journalist seeking to write a data story about use of force from RTR data.

Pitfalls

So, the Three-Step Context Detective consists of getting oriented, exploring all metadata and backgrounding the dataset (including the collection, the organisation, and the regulatory environment). In the process of building out these connections between the dataset and its broader context, there are two pitfalls to keep your eyes on.

There are many ‘Unknowns’ in the 2016 dog licensing data from the City of New York. We need to be careful not to make assumptions like ‘Unknown’ means ‘Mixed breed’.

First, beware of your own brain and its penchant for making assumptions to fill in unknowns. It is tempting when looking at columns in a dataset to imagine that you know what they mean, but this can be dangerous. In my data journalism class, we were working with data about the dogs of New York City -- their breeds, ages, and sex.

Of course, some fields had incomplete data and 'breed' was one of those with many 'UNKNOWN' values in the column. One student assumed that breed = UNKNOWN meant that the dog was mixed breed and built their whole story around that incorrect assumption ('UNKNOWN' means the information wasn't filled out by the applicant so we literally do not know the breed). Luckily, the student did end up checking their assumptions and revising the story, and the data itself was fairly low stakes. That said, this illustrates the importance of Jonathan Stray's advice about 'considering multiple explanations for the same data, rather than just accepting the first explanation that makes sense'. The same advice applies when assembling the context around your data just as much as it applies when analysing it.

Secondly, it's important to remember that power is not equally distributed across the collection process, the organisation, and the regulatory environment. The result of social inequalities in the data setting is that the numbers may appear to tell one story on first exploration, but that story might be completely false because the collection environment has systematically silenced people with less power. What does this mean?

In The Numbers Don't Speak for Themselves, Lauren Klein and I discuss a story written by three of my students about sexual assault data provided by the Clery Act. What the students found is that campuses with high numbers of sexual assault were not hotbeds of rape culture, instead these campuses were actually providing the most resources and the most supportive environment for survivors to come forward. So, paradoxically, the campuses with the lowest rates of sexual assault were not doing great but rather creating an environment that actively discouraged survivors to report. Meanwhile, those campuses with higher numbers were actually measuring sexual assault closer to what the reality of its incidence is. This kind of pattern will be visible anytime that structural forces like patriarchy, racism, or classism are at work (read: all the time) that lead to the systematic undercounting or overcounting of women and other marginalised groups. The way to address it is through establishing context – the students discovered this through background research, reviewing policy docs and many interviews – rather than accepting the numbers at face value.

Opportunities

Just as there are pitfalls for context, there are also opportunities for journalists and news organisations to create useful resources for readers and other journalists from their work on context. And context is work! Instead of writing a single story from a data exploration, organisations like ProPublica have started to create what Scott Klein calls 'news apps', that is, evergreen resources like Dollars for Docs. While it is useful for individual readers, Dollars for Docs has also become a data resource for other news organisations to write their own, localised stories on the influence of pharmaceutical companies – for example, this story about the effect of pharma money on doctors in St. Louis, Missouri. In this sense, ProPublica has become known as an 'information intermediary', by turning their original investigation's context and data into a resource that is reusable for other organisations.

ProPublica turns the context work that they do compiling and backgrounding datasets into a source of revenue in their data store.

Verified data and expert contextual information can also become a source of revenue for news organisations. ProPublica maintains a data store where you can purchase datasets on a variety of topics. Many of the datasets available come with excellent ‘data user guides’ – a term coined by Bob Gradeck, manager of the Western Pennsylvania Regional Data Center. In his work promoting open data, he saw the need for metadata that goes beyond the data dictionary to provide a narrative account of a dataset - where it comes from, how it is used by the organisation, and what its limitations are. Examples of Bob’s work can be seen in the data user guides for 311 data in Pittsburgh.

The Associated Press (AP) is also getting into the data + context = revenue game. They spent extensive time compiling a national database on school segregation in the US, which comes with a 20-page data user guide including where the data is collected from and what kinds of questions it can be used to answer. It's available for purchase, and the AP is starting to develop a subscription model where organisations can pay for access to other datasets, context, and discussions with reporters who worked on those issues.

Conclusion

The bottom line is that putting data back into context is laborious but absolutely necessary work for telling data stories. Context is the way to get the story right, even in the face of meager metadata, bureaucratic obstacles, and power imbalances in the data setting. Do you have stories about your work with context and data? Share them in our forums.

]]>
Data journalism on the blockchain https://datajournalism.com/read/longreads/data-journalism-on-the-blockchain Thu, 04 Apr 2019 16:29:00 +0200 Walid Al-Saqaf https://datajournalism.com/read/longreads/data-journalism-on-the-blockchain When we talk about blockchain and journalism, the focus is often on trust and sustainability. Micro-cryptocurrency payments, powered by blockchain, are being explored as a potential solution to the journalism industry's declining revenue streams. At the same time, the technology's ability to protect and verify information has been touted as an antidote to censorship and fake news -- but is this really all blockchain offers journalists?

Although it hasn't received as much attention, the transaction data that blockchain applications leave behind provides a potential goldmine for investigative journalists. Uses of blockchain range far and wide, including by premium wine makers, the tuna industry, and as a means to combat counterfeit pharmaceuticals, opening up a whole new world of data for journalists to explore.

In academia, blockchain has already been used to identify the people behind bitcoin transactions, research illegal activities, and more. Surely, if researchers can use blockchain as an investigative tool, data journalists can too.

Walid Al-Saqaf agrees. Previously a journalist, turned academic, Walid brings these two worlds together through his work as a Senior Lecturer in Media Technology and Journalism at Södertörn University in Stockholm, Sweden. He is also a Co-Founder and Vice President of the Internet Society Blockchain Special Interest Group. We spoke to him to find out more about how data journalists can use blockchain in their reporting.

In your own words, can you provide us with an overview of what blockchain technologies do, as well as some prominent applications of them?

As the underlying technology used by the bitcoin cryptocurrency, blockchain technology is a decentralised and distributed database system. I personally think that knowing the internal under-the-hood mechanism of how a blockchain works is not necessary to understand its applications and possible benefits. It may, in fact, be sufficient to say that blockchain is a new type of database that has three characteristics:

  1. It is distributed, meaning that a copy of the data contained on the blockchain is cloned on thousands of nodes.
  2. It is transparent, allowing you to see all transactions that have occurred since the blockchain was created and making any transaction traceable over time.
  3. It is immutable, meaning that it is not possible to change what is written on the blockchain. This is because every transaction is chained together with strong cryptography (hence the name 'blockchain'), creating a permanent archive.

Blockchains, such as ethereum, allow the creation and execution of advanced smart contracts that automate processes based on predefined rules, effectively ending the need for intermediaries to handle the execution manually. For example, you can use a smart contract to have an automatically executed crowdfunding campaign. In this instance, the blockchain would continuously monitor for incoming donations, automatically forwarding the total to the beneficiary once the goal is met. If the goal is not met, it would allow donors to reclaim their donations by calling the smart contract after the deadline.

But this is just one example. The applications of blockchain are many and range from use for exchange of digital assets to crowdfunding, storing and tracking real estate ownership information, electronic voting, securing healthcare records, preserving intellectual property rights, and much more.

Aside from funding, what are the various use cases for blockchain in data journalism?

Blockchain technology can be used to combat fake news, preserving intellectual rights of content providers, investigating and tracking transactions, limiting bias and external influence, combating censorship, protecting whistleblowers, enhancing citizen journalism.

To use blockchain technology effectively, it is necessary to have the technology diffused and adopted widely, which may take years. Nonetheless, I believe it is just a matter of time until we get to the point of having blockchain accepted in the mainstream.

On this last point, why are journalistic uses of blockchain contingent on it becoming more widespread?

In order to understand and deal with blockchains, journalists have to invest time and energy by getting the training, education and expertise necessary to use and analyse blockchain data. However, since blockchains remain at the very early stage of development, they are not yet scalable and certainly not yet ready for mass adoption. In order to reach critical mass, the technology needs to go through some structural changes and enhancements. Until that happens, journalists can still benefit from exploring some blockchains (like bitcoin and ethereum), but they may have to wait until blockchains become more widely spread to effectively use it for day-to-day activities.

What kind of stories would most benefit from leveraging blockchain technologies?

Data contained in blockchains is well-structured and relatively easy to access through a number of APIs and open-source tools. This means that all kinds of stories could benefit from using blockchains as a data resource. Once journalists have extracted this data, they can use it to undertake investigative journalism in the traditional sense.

This means that it is possible for data journalists to extract and analyse blockchain data for many different stories that may be relevant to the public. For example, journalists could look at the rise in the number of cryptocurrency purchases in countries suffering from economic turmoil. Since cryptocurrencies are not linked to any particular government or central bank, they can theoretically survive a global financial crisis. So, an increase in purchases may indicate that people are purchasing bitcoin as a store of value or that people's trust in government is diminishing.

Another scenario worth considering is the anticipated wider application of smart contracts to facilitate a plethora of automated services. Such a development can theoretically lead to the replacement of centralised platforms, such as Facebook and Airbnb with alternative blockchain-based systems that use smart contracts instead.

Data journalists can also write stories describing overall activities of certain bitcoin wallet addresses by aggregating the information obtained and analysing thousands of transactions recorded on the public blockchains -- these are open and easily accessible for free via services, such as blockchain.info.

Can you list some examples of data journalism that have already used blockchain technologies as a reporting tool?

Instead of giving a list of examples, I'd rather give an example of the two main types of data journalism stories that one can do by using blockchain as a reporting tool: micro and macro.

In the micro case, a journalist would zoom in and follow the trail of transactions sent from or received by a particular wallet address. One example of such an approach is manifested in the reporting done for Quartz by Keith Collins on the WannaCry ransomeware attack.

A screenshot from Keith Collins’ blockchain-powered investigation into ransomware attacks. Live version here.

Collins was able to know when any bitcoin address sent an amount to the three WannaCry addresses, which received more than $140,000 in bitcoin. He also identified when those hackers started cashing out by forwarding the received funds to other accounts. The fact that public blockchains are trackable makes it possible to trace back any payment to its original source address.

This is a feature that makes investigative journalism quite exciting, and also has the advantage of making it possible to prove the findings reached in the investigative story.

The other type of reporting is macro, which tries to provide readers with an overview of aggregated information, instead of doing a deep analysis focused on only a few addresses. An example of this type is a report by Camila Russo for Bloomberg, which wrote that the criminal activity on bitcoin in 2018 constituted roughly 10 percent compared to 90 percent in 2013. The report noted that this drop may have been due to the realisation that criminal bitcoin transactions can be tracked (as was the case when US law enforcement units succeeded in tracking and taking down the dark web marketplace called 'Silk Road').

Going to the micro use case, what else can transaction data be used to reveal? It is limited to cryptocurrencies and 'follow the money' investigations, or can journalists use it as a source for a wider array of stories?

Depending on the blockchain and the type of data stored in the transaction, it can be both financial and non-financial information. Initially, it is logical to use bitcoin for tracking financial transactions and relationships between various nodes in the network since that is the most dominant use case for that blockchain.

There is also potential to explore smart contracts created on Ethereum to identify the level of activity and progress achieved for a particular crowdfunding campaign, for example. Looking at the utility of a Ethereum ERC20 token more broadly, it's possible to know how many smart contracts were created and how they were used over time simply by accessing the blockchain data. This can provide journalists with very useful insights on usage patterns regardless of the amounts being sent or received.

Furthermore, the bitcoin blockchain allows you to embed a small piece of text (maximum 40-bytes) using the OP_RETURN operator. This OP_RETURN operator is used to add text related to a particular transaction that could point to a URL, for example, or provide some clues or meaningful information as a permanent note connected to the transaction. In some research that I am doing to understand the breadth and reach of the WannaCry malware attack, for example, I am investigating all the transactions that involved sending funds to three WannaCry addresses. To this end, I looked for any leads in the OP_RETURN values to identify if there were any unusual messages. Sure enough, I discovered one of the transactions had the OP_RETURN string "Caution! WannaCry Address!!", which indicated that this particular sender, who sent a very small amount, wanted to warn other potential victims to not send money to the wallet address and keep a permanent timestamped record on the blockchain with this information. You can find that transaction directly on the blockchain here.

In short, every blockchain is different and has unique ways of embedding text or calling functions of smart contracts, making it necessary to know exactly what a journalist needs to get before going about the data extraction and analysis.

Turning to the fake news use case, how can blockchain be used as a data verification tool? Do you have any examples where it has been used by data journalists in this way?

Yes, I have completed an extensive study about Civil, one of the most popular blockchain-based journalism projects. Civil aims to prevent disinformation by using cryptoeconomics to incentivise users so that they take action when they discover any form of fake news or other malpractices by a Civil newsroom. The dynamics of how this is possible are described in the Civil Constitution. In my research, I've taken a critical look at the project to see if the Civil newsroom provides a relative advantage over traditional newsrooms in this way. My conclusion is that it does, but on the condition that Civil users act rationally and predictably and that abuse of the platform cannot corrupt the whole system.

One of the other promising uses of blockchain to combat fake news and disinformation is in the ability to record original content with an immutable proof of creation. Aside from the usual doctoring of images, one area that is causing headaches nowadays is the ease of manipulating videos using 'deepfake' technology. However, blockchain's ability to timestamp original content immutably makes it a viable way to help address this problem as demonstrated by Truepic, which is an app that uses blockchain to notarise images and videos when they are taken. This is done by storing metadata to certify the authenticity of an original image/video taken using a particular mobile device. I think that such technologies can provide more confidence to readers, assuring them that what they are viewing is not fake. If someone attempts to manipulate the original copy, that could easily be detected as fake since it will either not have an entry on the blockchain or its entry will come at a later time because it is not possible to change the original entry.

What level of technical skills do journalists need to start using blockchains in their reporting?

I believe it all depends on what level of sophistication journalists need to reach their objective. If the objective is to just identify trends in the cryptocurrency and initial coin offering (ICO) space for example, then they can utilise simple web-based API queries to extract the needed data that could thereafter be analysed using MS Excel or any other spreadsheet software.

Data journalists may also consider enhancing their programming skills in order to tap into blockchains that can give access to their content via APIs. One example of a blockchain that may be of interest in this regard is Steemit, a blockchain-based platform that allows users to monetise their own content through direct cryptocurrency donations.

As illustrated in this Steemit white paper, authors get paid when readers upvote their posts.

A study by Mike Thelwall investigated whether Steemit works as an effective social news platform by rewarding users for social content and curation. This research investigated 925,092 posts to understand how much they earned and what drives members of Steemit to send reward payments to certain posts.

Data journalists with the right programming skills might consider undertaking similar research. But if the intention is to analyse millions of connections between nodes in bitcoin and how they formed over time, a higher coding skill level (perhaps in Python or PHP) is needed to communicate with the blockchain and extract the data and store it in a database efficiently. To analyse nodes, journalists also need to have a deeper understanding of network analysis, and the use of Gephi or other network visualisation/analysis tools may be necessary.

How can data journalists start using blockchains as a data source?

Stemming from my belief that it is useful for data journalists to consider using public blockchains as a data source, it is high time to have journalists acquire the basic sets of skills that allow them to access, extract and analyse blockchain data. That's why I have personally started to work on an open source library, called the Data Journalism Github repo, or DDJBlocks (still in development) that could reduce the time and learning curve for journalists starting to look into bitcoin blockchain data for identifying potential stories.

For demonstration purposes, I have put together a Google map showing where donations to Wikileaks have come from over the years. This was done by using DDJBlocks to extract the transactions that include payments to one of the earliest known Wikileaks wallet addresses on the bitcoin blockchain. Thereafter, those transactions with relay IP addresses had their Geolocation identified in the form of a city. Then, I calculated the sums of all amounts coming from each city and stored these on a spreadsheet, which in turn was directly inputted into the Google Maps web interface to overlay the data on the world map.

Walid’s map of Wikileaks donations, powered by blockchain data. Darker blues represent larger donations.

I have also used the tool to do a network visualisation on the address that was used to pay 10,000 bitcoins in 2010 (worth around $41 at the time) for two pizzas from Papa John's. Hanyecz's 10,000 bitcoins have spread over the last eight years, ballooning to a worth of over $65 million today. Marked by bloomberg.com as a milestone in the decade-long history of bitcoin, that pizza purchase can arguably be considered as the first ever proof that a cryptocurrency with no central authority behind it could be used as a means of peer-to-peer payment system. It will be quite valuable for data journalists to detect and cover other milestones for bitcoin and the other cryptocurrencies in the years to come.

Any final thoughts?

I predict that this form of data journalism will be increasingly relevant as bitcoin and other public blockchains become more accepted in the mainstream and as blockchain adoption reaches new heights. The main challenge is to allocate sufficient time and resources, particularly by journalism educational institutes, so that students and researchers look into this field. It’s important that J-schools get ahead of the curve to equip the next generation of data journalists around the world. During my lectures at Södertörn University, I regularly bring up blockchain as a technology that students need to be aware of alongside traditional centralised database systems.

]]>
The essential lies in news maps https://datajournalism.com/read/longreads/the-essential-lies-in-news-maps Thu, 04 Apr 2019 10:29:00 +0200 Maarten Lambrechts https://datajournalism.com/read/longreads/the-essential-lies-in-news-maps "Not only is it easy to lie with maps, it is essential."

While this may seem to be a bold and surprising statement, it's a long held view of renowned geography professor, Mark Monmonier.

And, of course, Monmonier is right. In order to display the big, three-dimensional, and complex world we live in on a small piece of two-dimensional paper or on a screen with a limited number of pixels, we are forced to distort reality. As you'll soon see, every map does so in its own way. So how can maps, which distort reality, be married with journalism, which tries to paint an objective and accurate image of the world?

In the context of news, maps are an intuitive way to show the location of where events took place, but they can be so much more than this. Maps can also explain how things happened, they can be the canvas on which a story is told, they can put the size and extent of things in context, and they can be used to show geographical patterns hidden in data.

So, are all maps in the news lying? Are all news maps 'fake news'? If done well, they are not. But it is quite easy to produce misleading maps, even with the best intentions. Because there are plenty of reasons to use maps in the newsroom anyway, let's look at some commonly used maps and learn how to avoid being misled by them.

Maarten Lambrechts is a data journalist, data designer, and visualization consultant. Follow on Twitter: @maartenzam.

The locator map

A good news story answers the '5 W' questions: the who, what, when, where and why of something that happened. When an article only mentions the where of a story in the text, many people will not be able to really connect to it. A lot of readers simply lack the geographical knowledge to pinpoint Lombok, Lithuania, Luanda or Leicester Square, and to relate these locations to the places where they are living themselves, or to other places they are familiar with.

The visualisation of a location through a locator map can overcome this problem. This type of map helps the reader to contextualise a news story geographically; it shows the location of an event in the context of the surrounding geography, offering many entry points to the map for the reader to connect with. Locator maps in the news show where that earthquake happened, where that exotic tax haven is located, where in my city that bank was robbed and where exactly in the world that ongoing violent conflict is taking place.

Devastation in Lombok

A locator map showing the location of Lombok, part of the introduction of Devastation in Lombok, by Reuters Graphics.

Locator maps let people assess how a news story is related to their own life. Did something happen close by? Did something happen in a country they've visited or where people they know live? Or did something happen in a country neighbouring a country they know or have some kind of connection to? Based on these questions, readers can quickly evaluate how relevant a story is to them. And a locator map makes this evaluation easier than providing a description in text only.

So, in what sense does the humble locator map lie to the reader? Well, locator maps usually are very small in order to be readable. This means that these maps leave out many details: sinuous roads become straight lines, smaller roads are left out (or roads are left out altogether) and a group of mountain peaks can be represented by a single symbol for a whole mountain ridge. In other words, these maps are heavily generalised, which limits their accuracy and broader use cases. Don't use them for navigational purposes, for example. You will get lost.

The breaking news map

Maps can do more than show the ‘where’ of a story. When news breaks, the big challenge for journalists is to explain to their audiences how events unfolded. In many cases, the best way to do so is by using an annotated ‘breaking news’ map.

A breaking news map showing what happened in Nice on 14 of July 2016, from What We Know After Terror Attack in Nice, France by the Wall Street Journal. Notice how the annotated map at the top is accompanied by two locator maps: the first one to situate Nice in France and the second one to situate the attack in the city of Nice.

While locator maps only communicate 'something happened here', annotated maps can show a sequence of events and convey other information relevant to the story. People familiar with the location can mentally replay what happened by connecting what’s on the map with how they know the place.

Often these maps use oblique, 3D-views of a city, so people unfamiliar with the location can still get a good sense of exactly what happened and how things looked on the ground. With Google Earth Pro you can generate these very detailed, oblique views for free.

An annotated map describing the events in Berlin on 19 December 2016 by Spiegel Online.

But be careful. Some people might think these 3D images are real pictures, taken at oblique angles from airplanes or helicopters. Remind them that they are not; instead, they are generated by Google Earth by 'draping' satellite images over a detailed 3D model of the Earth. In some cases this process leads to glitches, as the below ferris wheel in Scheveningen, the Netherlands, clearly demonstrates.

The Scheveningen ferris wheel, a glitch in Google’s 3D model of the earth.

It’s also important that readers and visual journalists remember that Google Earth images are typically a few months to a few years old. Suggesting that these images are 'live' or taken after the breaking news event took place would be lying.

The extent map

Let’s move along to a map type that is definitely lying to the reader: the extent map. In order to communicate the size and magnitude of things around the world, these maps cut them out of their real geographical location and paste them into a foreign one.

For example, during the 2014 Winter Olympics in Sochi, the New York Times produced an extent map to illustrate the size of the site’s infrastructure, by copy-pasting these into the streets of Manhattan.

Luge, Bobsled and Skeleton

Racers might begin their starting sprints 40 stories up and several blocks north of Times Square for the run down the city’s own version of the Sanki Sliding Center’s track, finishing in a big turn on the plaza in front of the Armed Services Recruiting Center. Credit: Is That a Luge in Times Square? by the New York Times.

Many readers lack good reference points for assessing how long a 400 meter ship really is, or how far 5.800 square kilometers really stretch out. By showing these objects and areas on maps that the reader can relate to, a direct connection can be made to their reference frames.

When the iceberg A-68 broke away from the Larsen C ice shelf in Antarctica, many media compared it to the size of Delaware. For Americans familiar with the size of this state, this comparison was probably helpful. But for many readers the size of Delaware is just as abstract as 5.800 square kilometres, the actual size of the iceberg. That's why the Berliner Morgenpost built a little interactive extent map, which the reader can use to copy-paste the iceberg's silhouette to any number of familiar places.

Extent maps are only useful when the area used for comparison is familiar to the reader. Otherwise the question of 'How big is that iceberg?' remains unanswered and the reader ends up with more questions, like 'How big is Delaware?'. Some readers may even be confused and think the iceberg (which is in a place far away from them) is located near Delaware (an equally remote place for some).

The before-after map

A map type that is becoming increasingly popular in news stories today is the before-after map. While before-after images have a long history in the news, journalists have previously been limited to photographs taken from the ground. Before and after images are now mostly taken from space, by satellites circling the globe.

Until recently, detailed satellite images with a high temporal resolution were simply not available and as a result details were too blurry or time intervals between pictures were too wide to be useful. Today, a range of satellites with high resolution cameras fly over the same location with a frequency of once every week and sometimes even once per day. This allows for the detection and reconstruction of events like like deforestation, floods, droughts and the construction of buildings.

NASA's Images of Change is an example of the powerful impact of before-after maps. Included in the gallery are images highlighting the impact of drought on Europe, forest fires in California, and hurricane damage in Puerto Rico.

Newsrooms have also discovered the power of before-after satellite images. Often, the supplier of these images is Planet, a Silicon Valley satellite imaging company that offers daily high resolution pictures with its 'flock' of shoebox sized satellites. The company has already provided images for news stories about the construction of Chinese coal plants, the effect of drought on German vegetation, and the development of North Korean missiles.

Before-after maps are usually very explicit about the date images were taken, so there is no room for misleading there. But like all aerial imagery, these images are still flawed. For example, images are only useful for before-after maps when they are taken during daytime and on cloud-free days. Ever noticed that the sun is always shining in these images? For this reason, photos showing destruction after a big storm usually take some time to become available after the clouds have disappeared. Some parts of the world are also much cloudier than others, so clear satellite pictures for these regions are rare.

And that's not all. All of these raw images need to be corrected with good colour correcting algorithms. Differences in the angle at which the sun is illuminating the scene, differences in atmospheric conditions, and the variations between cameras on board different satellites introduce biases and glitches in satellite images. Ignoring these differences, or trying to remove them with badly designed algorithms, will lead to misleading before-after images. Brown areas 'affected by drought' could well be looking a lot greener when processed incorrectly, for example.

Numbers on maps

'There are lies, damn lies, and statistics,' the saying goes. So what happens when you mix statistics with maps, which are distortions of reality by definition? You get numbers on maps, and those are really easy to screw up and can very easily mislead.

These maps don’t serve as general reference maps, using data to show geographical patterns about a certain topic instead. Most commonly, they are used by journalists during election times, to show where people voted and for whom.

A 2016 election map showing voting patterns in Berlin. Interactive version by Berliner Morgenpost.

Maps that show administrative areas shaded according to some data value are called choropleth maps. These are useful for showing geographic patterns in statistics that are collected at the level of administrative units; for example, the average age of the population for each country, or the share of impoverished population living in the municipalities of a country.

One common mistake when making choropleth maps is using absolute numbers instead of relative (or 'normalised') numbers. Values need to be scaled to the population inhabiting every administrative unit. If numbers are not scaled, the result is a map like the below (which I understand is a favourite of President Trump’s):

Tweeted on 11 May 2017: "Spotted: A map to be hung somewhere in the West Wing".

On this map, population density is not taken into account and, as a result, a lot of the Republican red dominates the map (that’s why President Trump likes this map so much). The millions of blue Democratic voters concentrated in the big cities on the eastern and western coast are not well represented on this map, because they live in a relatively small geographical area.

Many techniques exist to overcome this problem, one of these solutions is called a cartogram.

A US election map that scales the states to account for the number of voters living in it. Credit: The New York Times.

With this technique, densely populated areas are assigned more visual space on the map. But cartograms have their own downsides: they distort the geography considerably, as is clear in the example above.

Lying world maps

Probably the biggest and most frequent lie in mapping refers to world maps that use the Mercator projection. Almost all online maps use the Mercator projection, but the projection is used in many static and offline world maps too.

The problem with the Mercator projection is that it distorts areas close to the poles enormously. Did you know Greenland isn’t really the same size of Africa? It’s actually 14 times smaller.

Because of these distortions, it is better to avoid the Mercator projection for maps showing areas close to the poles and world maps. This is the reason why Google Maps decided to switch to an orthographic projection when zooming out to a world view.

As a rule of thumb, journalists should try to avoid using the Mercator projection when making world maps. Good alternatives are the Robinson and Winkel-Tripel projections, or the recently developed Equal Earth projection, which respects areas throughout the whole map.

WebGL-based maps

Today's modern web browsers support WebGL, a technology that allows browsers to tap into a computer's graphics card, opening up a whole new range of mapping possibilities.

WebGL makes it possible to rotate maps, tilt the camera view, and visualise data in three dimensions -- it's an exciting new toy. But all these fancy features make it harder for readers to assess the numbers behind a visualisation. Tilted views make features in the back look smaller, and even hidden or obscured by other features in front of them. And because the camera view can be rotated and flown around, the north is not always up in these maps. This may confuse readers.

Let's look at an application of this technology in One belt, one road by The Financial Times. This story uses an animated map of the new Chinese Silk Road, which reacts as the user scrolls through the text. In this way, different features on the map can be highlighted by zooming and rotating the map to get the best view of each section of the new Silk Road. Although this feature helps step the reader through the story, it also means that North is not always up, which can be confusing for readers unfamiliar with the countries and cities shown on the map.

The ‘One belt, one road’ project by the Financial Times.

The adoption of WebGL by mapping tools like Mapbox and more recently kepler.gl has opened the door for WebGL driven maps that show numbers. These have already found their way into the media, as An Extremely Detailed Map of the 2016 Election by The New York Times shows.

An extremely detailed map of the 2016 election by The New York Times.

Every map is a lie

Mapmakers make a lot of design decisions in order to produce clear and useful maps. They leave things out, simplify things, highlight elements and put other elements in the background. Areas, shapes and lines are distorted and geographical features may be shifted out of place. Sometimes old imagery is used to show where recent events took place, and unlike in the real world the sun is always shining in satellite images.

The degrees of freedom in the design of a map are infinite and by changing the size of features, the colours, the layering, the composition and the projection of a map, a different story is told and other distortions are introduced.

This illustrates perfectly another point made by Professor Monmonier: "A single map is but one of an indefinitely large number of maps that might be produced for the same situation or from the same data."

Maps are powerful, but sometimes a misleading picture is generated. Mapmakers, as well as map readers, should be aware of their limitations. So, remember that all maps are a lie. But these are necessary lies.

Now that you've learnt how maps can lie, expand your mapping skills by:

  • taking Maarten's video course Mapping for Journalism to create both static and interactive maps for your stories
  • learning a data visualisation trick, which allows you to represent tiny and bigger countries on the same chart
  • exploring the community's favourite maps in this edition of Conversations with Data.
]]>