Where does data come from? What can journalists do with it? And what happens once they're done with it? These questions, and more, are explored by Jonathan Stray in the Tow Center's 2016 publication The Curious Journalist’s Guide to Data.
We asked Jonathan to expand on some of the book's insights to satiate the data curiosities of interested journalists.
Your book is divided into three interconnected aspects of data journalism: quantification, analysis, and communication. Which part is the most challenging for beginners, and what advice do you have for them?
I suspect that the explanations of sampling error and statistical inference will be challenging for many people, because they are the most mathematically detailed. You can find this stuff in any stats textbook, but usually not in a way that's easy for people to read. I've tried to clarify the underlying logic by showing how all we are really doing is counting. And that's my advice for learning mathematics: it's really not about the equations. It's not, in the same way that an essay is not about the letters the writer used. Learn the underlying concepts, the underlying reasoning, because that's where the math comes from. That's what the math was invented to express.
Throughout your book, it is clear that to work with data you have to understand how it was quantified. For data newbies, or those less confident with math, how do you suggest approaching this?
Quartz recently published a fantastic guide for working with dirty data -- which is all data -- including a lot of excellent questions to ask about how it was created.
You acknowledge that journalism often focuses on abstract concepts that are hard to quantify. What role does creative thinking play in devising ways to quantify these? Do you have any favourite examples?
One question that many people have asked is why we count economic ‘growth’ in terms of GDP. There have been lots of suggestions that we should count ‘wellbeing’ or ‘happiness’ or ‘development’ or something else instead, and all kinds of quantification strategies such as the UN's Human Development Index (HDI) or the Genuine Progress Indicator (GPI). These are complex attempts to define what we care about. I often wonder about simpler ways to measure growth. What if we just used median income instead? That would make ‘growth’ much more sensitive to income inequality, but we'd still have all the advantages of measuring in terms of money. It's far more practical and interpretable than more complex schemes. But it undoubtedly misses other things. Inventing good quantification methods is really an art, and depends deeply on what we think is important.
You also discuss the high degree of uncertainty that data journalists work with. Does this uncertainty differ from other types of storytelling? Why, or why not?
There's uncertainty in all reporting, which is why journalists look for multiple sources and so forth. But journalists are not used to quantifying uncertainty. Very often journalists aren't even really aware of the uncertainty in what they're reporting. Political commentary is a great example of this. Five Thirty Eight showed how silly it really is to talk about the outcome of an election without a statistical model. But lots of other areas are equally silly right now. Financial and business reporting comes to mind. Most market trend stories are really built on sand.
Confusing causation and correlation is a common mistake, within and beyond the datasphere. What mistakes do you see data journalists often make, and what can be done to prevent them?
I still see confusion about causation all the time. It's a subtle concept. And we want so badly to say ‘hey, this study says chocolate makes you smarter!’ It's a great headline. This is a problem with uncertainty in general -- certainty usually makes for much more powerful communication. Scientists are much more accustomed to hedging their language, but journalists don't like to do it. This reduces the intelligence of the conversation, just like dumbing down any other fact.
A large theme in your book is the potential for data stories to mislead audiences. What role does the consumer play in combating these kinds of misinformation?
Media literacy is a key skill for a 21st century citizen, and I think it should be taught in schools. But journalists have to communicate with the audience we have, right now, today.
It makes no sense to blame the audience for not being smart or critical enough, or to wish they were better educated. It's our job to communicate clearly anyway.
In your book, you state: for “the reader to walk away with a fair and representative idea of what the data means out in the world, [your] examples should be average. They should be typical. This goes up against journalism’s fascination with outliers”. A journalist’s fascination with outliers probably derives from a desire to tell gripping stories. How else can journalists make average examples compelling?
It comes down to whether the audience already understands the typical case. For example, in the case of crime statistics, it might surprise people to know that violent crime is decreasing. They may find it interesting. Although if the audience already understands the true situation, maybe the outliers can tell them useful things. Another way to ask this question is, how does a single story relate to the larger truth? One of the things I'm trying to get people to think about is how their story functions in the context of all the information someone consumes.
To end your book, you confess that "it can take a while to find your voice in data journalism". How did you find your voice, and how can others learn from your experience?
I have various fascinations including the algorithms that run our society, finance, and politics. I continue to do more stories in those areas. Honestly, writing this book, although it's not journalism itself, was part of finding my voice. Some of the best creations come from scratching your own itch.
Read the full book here.