The best ‘data stories’ are not obvious. They don’t hit the reader over the head with numbers, at least not initially. But the data is the very foundation on which the story is built, and it can help guide reporters to the best anecdotes or ways to illustrate their findings.
Good data editing requires an understanding of all that, along with critical thinking, project management skills, and a better-than-average understanding of the content, context, and organisation of the data.
Just as all investigations need a process that works for everyone, so do data investigations. Knowing the process can help editors ask the right questions and backstop reporters. It’s also important (and helpful to the overall story) for all team members to understand the methodology, regardless of whether they’ll be working with the data.
Good data projects have good workflows, which depend in part on backout scheduling. Know your publication deadline, and then back up each important prior mark that must be hit. Data-based investigations have more moving parts and more predicates, so it’s vital that everyone knows the order in which the work must get done.
Getting started: guidance for journalists and their editors
Journalists, start as you would any other investigation: with initial research and reporting to identify what data and documents exist. If you get stuck, look for examples of similar reporting (these can be found in the IRE Resource Center) or search scientific and academic works to identify experts who have done similar work or share the same interest in your topic. Often such sources can help you streamline your preliminary data research.
It’s important to remember that initial story memos or pitches should include information about what data is available and how that data might help you tell the story.
Once the data is identified and acquired or assembled (more on that later), it and any notes should be somewhere the entire team can access during the reporting and editing process. The key is to keep everyone in the loop so there are no surprises. When working on the data analysis, avoid using email to track changes or make updates, which can get lost or be confusing. If you have a project management system in your newsroom, you may want to use that, but tools such as Github or Google docs also work well.
If you’re an editor who doesn’t use programming, you should still make sure scripts contain comments, explaining what the code does, so that another person could follow along and understand your reporter’s data and methodology.
You also need to have an internal process for independently double-checking the data analysis. And keeping a data diary is essential to this end. For example, if you have a large enough team, one reporter could be the backstop for another. You also might ask a trusted colleague outside the organisation to be your backstop.
Data diaries also come in handy for keeping track of file names, code, and syntax, and footnoting and lawyering copy. Mostly, they need to be clear enough that colleagues, including the editor, can make sense of the work.
Bulletproofing the data and its analysis
Bulletproofing a data driven investigation begins with bulletproofing the data before starting the analysis. This is important because often the simplest problems get overlooked, such as not having all of the relevant records. Editors who know this can backstop their reporters’ practices and ensure that their stories are set up for success from the get-go.
In addition, you should always work off a copy, instead of the original data, in case something bad happens while you’re doing the analysis, such as your computer dies or you accidentally introduce an error into the data. As you do your checks, always keep notes on what you found -- it will help you later when you do your analysis. Here are the checks that data journalists should conduct on every dataset, which can also be used by editors to backstop analyses:
- Check that you have all of the relevant records. It’s easy for an agency to accidentally miss some records, either by copying and pasting data or reusing an old query that pulls only certain records. If there is no reference for the exact correct number, use common sense. Would it be reasonable that the United States would have only 80,000 voters?
- Make sure all locations, such as cities or counties are included.
- Look for inconsistencies in key fields. For example, are city names spelled the same way? It’s important because it could affect your results. You can do this check by getting a list of all unique possibilities within a given field and sorting them alphabetically.
- Make sure that numeric fields are within valid ranges. For example, does your data include dates of birth that would make individuals too young or too old?
- Check for missing data or blank fields. Make sure that you did not cause these problems by importing data incorrectly. Look at the file in a text editor to be sure.
- Double-check totals or counts against summary reports from the agency.
- Know your data. Know what every field means and how the agency uses it. Something that looks boring to you could be critical to your analysis.
- Talk with the folks who work with the data and ask them about the checks they do.
Once you’ve checked your data, you’re ready to do your analysis. Keeping notes about what you do will be crucial at this stage. Those notes will help you write your methodology later and will help you (and your editor) vet the findings. As you go along with your analysis, be sure to regularly back up your data and use a naming convention that makes sense to you and to others who may use the data. Here are a few other tips to keep in mind as you undertake your analysis:
- Make sure you’re using the right tool. You may need to do more than counting and sorting.
- Check with experts from different sides of the issue about your methods and your findings.
- Beware of lurking variables. The trend you found could be caused by an underlying variable you haven’t considered.
- If you think you’re in over your head, call on an expert to help. Don’t guess or assume.
- Double-check surprising results. For example, if citations spiked by 50% in one year, it could be a story or it could (more likely) be an error.
Often data ‘analyses’ are counting or summing data, but if you need to do a more complex analysis, here are some suggestions to help you figure out the best methodology:
- Read research reports. Academic research on your topic might reveal best practices for working with your data.
- Find an expert to vet your methodology. Many are happy to help, especially once they realise you’re interested in doing a serious analysis. When the Dallas Morning News examined jury strikes, one of the leading experts on bias in jury selection reviewed all of the reporting teams’ findings.
- Show findings to the targets of the story. We’re not suggesting sharing your story, but you should put together a findings document or presentation that you can share with the targets of the story. This helps bulletproof your methodology by surfacing any problems that may exist (or variables you didn’t consider) before publication.
- Duplicate your work. To make sure you didn’t mess something up along the way. Don’t just rerun original scripts, recreate them so you know they were done correctly the first time.
- Maintain a consistent universe of cases. If you have to filter or redefine your universe, be able to explain why you isolated certain records or cases.
- Give yourself enough time to follow through on collecting information for your database before you start writing. If you’ve built your own database, where information may need to be updated or will change after additional reporting, set a cut-off date and don’t make any more changes to the database unless the data is inaccurate or the new information will change the meaning of the story.
- If you are doing the data entry yourself, make sure at least two people have reviewed every record, or consider hiring a data-entry firm that uses double-entry verification.
Bulletproofing the process
Editors of data investigations must ask even more questions than usual and do their own research. In much the same way that the reporter may have identified similar works of journalism or scientific studies, editors should familiarise themselves with those methodologies as well.
Also, it’s essential for editors to know and understand a database’s ‘record layout’, or more simply put, the kind of information that is contained in the data and how is it broken down and organised. If there is a ‘read me’ file that accompanies the data, which often describes known quirks or problems, it’s the editor’s responsibility as much as the reporter’s to read and understand those details.
You should discuss known or suspected problems in the data with your reporters and the whole project team. In fact, even if your reporters don’t bring them to you, ask what the problems are, because data always has problems. Listen to your reporters carefully, and have regular check-ins with them to see what’s worrying them. Look at the data yourself, or if you’re not conversant in the software, ask the reporter to give you a guided tour of the data. Don’t be shy about challenging the data, if there’s anything you don’t understand or that doesn’t pass the sniff test. Encourage creative thinking and brainstorm solutions.
Finally, have your reporters write their methodology (or a ‘white paper’ if it’s a long and complicated analysis) before they start drafting any story. Most often, methodologies (aka the ‘nerd box’) are written at the end of a project drafting process. But it’s not at all uncommon -- once a reporter has to explain all the details of how they conducted the analysis -- that the story language needs to change. Once a detailed methodology is written, it’s even possible that you find some misunderstandings amongst the team over what was done and how. It’s better to surface these issues before the writing begins.
Here are a couple of examples of handling methodology in copy. Aside from including a few paragraphs in your story, which is how simple methodologies can be handled, you can write a separate short story on what you did. You can also produce a more detailed white paper, which allows you to go into great detail on a complex analysis and can have the effect of creating greater confidence and transparency around your work.
To sum up, here are 10 questions every editor should ask:
- Does the data answer our questions? Does it surface other questions?
- Where did you find the data?
- How did you vet and clean the data?
- How did you calculate those numbers?
- Are you keeping a data diary?
- Did you replicate your data work? Could someone else?
- Have you consulted experts or done a scientific literature review?
- Do we need a white paper?
- Could you write a nerd graf/story if asked to?
- What is the significance of the data? (Don’t confuse effort with importance.)
Writing the data story
As we mentioned at the start, the best data stories are not data heavy. They don’t ask readers to ‘do the math’, and they don’t subject the narrative to a lot of numbers. They tell the story or stories that the data has surfaced, through interesting characters or circumstances. As you guide your reporters through the writing phase, consider some of the below examples.
Not until the sixth paragraph do the writers introduce the idea that the topic of the story -- spinal cord stimulators -- is being examined because the devices rank among the top of those causing patient harm. In fact, the sixth and seventh paragraphs help form the nut grafs of the story: often where you find data first appearing in stories that take a narrative approach:
“But the stimulators — devices that use electrical currents to block pain signals before they reach the brain — are more dangerous than many patients know, an Associated Press investigation found. They account for the third-highest number of medical device injury reports to the U.S. Food and Drug Administration, with more than 80,000 incidents flagged since 2008.
Patients report that they have been shocked or burned or have suffered spinal-cord nerve damage ranging from muscle weakness to paraplegia, FDA data shows. Among the 4,000 types of devices tracked by the FDA, only metal hip replacements and insulin pumps have logged more injury reports.”
This is how The Philadelphia Inquirer started its story on the findings of a year-long investigation into how children in public schools were suffering from environmental poisoning:
“Day after day last September, toxic lead paint chips fluttered from the ceiling of a first-grade classroom and landed on the desk of 6-year-old Dean Pagan.
Dean didn’t want his desk to look messy. But he feared that if he got up to toss the paint slivers in the trash, he’d get in trouble.
So he put them in his mouth. And swallowed them.”
There’s no indication in the opening paragraphs that these findings are the result of a data analysis until the eighth and ninth paragraphs:
“As part of its “Toxic City” series, the Inquirer and Daily News investigated the physical conditions at district-run schools. Reporters examined five years of internal maintenance logs and building records, and interviewed 120 teachers, nurses, parents, students, and experts.*
When the newspapers analyzed the district records, they identified more than 9,000 environmental problems since September 2015. They reveal filthy schools and unsafe conditions — mold, deteriorated asbestos, and acres of flaking and peeling paint likely containing lead — that put children at risk.”
This is not to say that using the findings of a data analysis in your lead is always a bad idea. Each story, including its methodology and findings, needs to dictate the best approach to follow. Consider these examples:
The Post and Courier in South Carolina won a Pulitzer Prize for its 2014 domestic violence investigation, in which the data analysis was the lead because the numbers were so startling:
“More than 300 women were shot, stabbed, strangled, beaten, bludgeoned or burned to death over the past decade by men in South Carolina, dying at a rate of one every 12 days while the state does little to stem the carnage from domestic abuse.”
Compared to this narrative-based example from ESPN, on what ends up in some US stadium foods:
“Most Cracker Jack boxes come with a surprise inside. At Coors Field in Denver, the molasses-flavored popcorn and peanut snacks came with a live mouse.”
Illustrations, graphics, and videos are often your best friends in presenting data stories, as they can do the heavy lifting of the analysis, allowing your reporter’s storytelling (in any format) to flourish in the findings, not drown in the data.
Some examples below illustrate how data can achieve storytelling goals:
A collaborative investigation into the death toll in Puerto Rico caused by Hurricane Maria, which was named 2019 investigation of the year by the Data Journalism Awards. The powerful interactive embedded in the story showed how the numbers grew beyond initial reports. Greatly. It also included a searchable database with profiles of the dead.
A look at how thin models must be to walk the catwalk by NOS, Netherlands, which uses a combination of video and graphics to illustrate the findings of an analysis into the sizes of 1000+ models.
Ocean Shock, a Reuters investigation into the effect of the climate crisis on marine life, and a stunning and engaging visual presentation of data.
ESPN’s 2018 analysis of food-safety inspection reports for professional sports venues; this powerful data presentation allowed writers to deliver up the mouse lead above. The graphics provide lots of numbers without overwhelming the reader.
One last thing...
Earlier, we referenced the possibility of doing an investigation based on data you assemble and analyse yourself.
In many ways, that is the most original form of data investigation because you’re not analysing some other entity’s information but rather doing the ground-up reporting that will give you truly unique findings.
That’s the up side. The downside is that this form of investigative data analysis is extremely labor intensive and fraught with potential methodology questions and errors. It requires more time and greater levels of bulletproofing from reporters and their editors, so plan accordingly if you decide it’s the only way to answer that burning question.