Course introduction
The images used in the course presentations are by: Michael Maggs, Yuval Segal, Tyler Vigen, Minivalley, Pixabay, Gallup, Matthew Ferguson, Dennis Rogers, 20th Century Fox (The Simpsons), Bill Branson, Quinn Dombrowski, Daniel J. McLain, Rudyard Kipling.
Hi. My name is Stijn Debrouwere, and I do analytics for various media companies. I used to work at The Guardian and at Fusion a TV station in the US. This course is Bulletproof Data Journalism. What we'll be talking about is the fundamental question that reporters always face which is, "I found a pattern in the data but can I really be sure that this pattern is more than just a spurious correlation? Is it cause and effect?" To do that, what we will talk about is the ways in which basically everything that can go wrong, everything that could stand in between a correlation and cause. And we'll learn about five different types of error. But before we do that, let me talk a little bit about how kids learn about causes and correlations because that's very instructive.
Actually, kids, when they're eight to nine months old, that's when they first learn about how one thing can cause another. Before that age, they've done this very interesting experiments, psychological experiments, where they show that if you show a kid a video of a billiard ball hitting another ball and then nothing happening, the ball just stops. A five-month-old or six-month-old kid, they don't care. They think nothing interesting has happened. But you show the same video to a kid that's nine months old and they will be very surprised. Because isn't the one ball hitting the other, isn't that supposed to cause the other ball to move?
If you think about it, how does a kid learn about these things? Well, it just repeatedly sees in real life one thing happening and then another thing happening, one after the other. So, the ways in which we learn about how one thing causes the other is we don't really have any supernatural powers to do that. We just see one thing happening and then another thing happening, which can be very annoying because we're always not sure if what we're seeing is really true.
A great story to illustrate that, is a story that was once told to me or that I read about in a book by Daniel Kahneman. He's a famous psychologist. He said -- he used to be teaching at Princeton in New Jersey, but he lived in New York. Every week, he would drive up and down from Princeton to New York, home and back to school. One Sunday drive, he encountered a burning car. It's kind of strange but I guess it happens. He didn't really think that much about it. All right, it's a burning car, moves on, continues to drive. But then next week, he does the exact same drive and in the exact same spot, he encounters another car burning. Now he gets to wondering, "What's going on here? Is there something?" That's odd, right?
But it was just a coincidence. But even though it was just a coincidence, for years afterwards, he said every time he did that Sunday drive, he would imagine that there would be a burning car. He would always look at that exact same spot and he would say, "There's going to be a burning car there." Of course there was never after those first two weeks, there was never a burning car. So that shows how the ways in which we learn about correlations and causes, and about how nature works, and about how society works, how sometimes it's very easy to fool ourselves and fool our minds.
And so, we need a framework, a framework we can use to decide whether the decisions that we make and whether the conclusions that we draw from the data that we see, both in our own journalism and when evaluating claims by others, whether those kinds of claims actually make sense. Lucky for us, such a framework does indeed exist. We really need it because currently the way journalists deal with the fact that all our our conclusions are uncertain is by saying, "well, I found a pattern and I know that correlation is not causation, but I'm going to tell you anyway." This is how people deal with it. It's just like, I can't be sure that what I'm saying is true but I don't know how else to argue in favor of it or how to argue against it so, let's just leave it at that.
And so, correlation is not causation has become a meaningless disclaimer because instead of using it to say like, "Hey, you should be careful," people use it as an excuse to then proceed to say anything. Here are some interesting quotes, "Correlation is not causation, but here is more evidence rejecting the view that world globalization means wider gaps between rich and poor countries." These are quotes from just a two-minute google search, really. Second quote, "Correlation is not causation but the plunge began just as the SCC voted in favor of high-speed trading firms being registered with FINRA." "Correlation doesn't imply causation, but these statistics suggest, etc., etc."
"Of course correlation is not causation, but these studies, at the very least, suggest --," again. And so "correlation is not causation" started to mean the opposite of what it was intended. It was intended originally as a warning to people to not take conclusions too seriously and to think really, really hard before drawing any conclusions from data. Nowadays, it means the opposite, and so we need a better way. We need a better way to think about how we can prove that our data and that our data journalism is really up to scratch.
In this course, we will talk about five types of errors, five types of problems. If we can make a reasonable argument that those five types of errors, that our research, that our data journalism doesn't suffer from them, then that's a pretty good standard, those five types of errors. We're not going to go into detail now. That's for the next couple of modules. But those five types of errors are, one is coincidences and freak accident. Sometimes stuff just happens and it's random and there's really no deeper meaning to it.
Hidden influences, sometimes we think it's very clear that one thing causes the other, but there can be so many other factors that influence A and B and that standing between. Cherry-picking, often our data is biased, only includes some people not others and that makes it hard to draw true conclusions. Flaws in the data, sometimes our data just isn't that good. Of course, if you don't have good data then you can't do a good data journalism, garbage in garbage out.
And then lastly, very important one especially for us in journalism, is misinterpretations and coming up with stories about what you think the data means that might not necessarily be grounded in fact. It's these five types of error, these five types of mishap that will be talking about in this course. Hopefully by the end of the course, you will know how to avoid these types of error, how to spot these kinds of errors in other people's work, for example, when you're reading a scientific paper or something of the sort. Because of that, you will be able to be very confident in your data journalism, very confident that the work that you put out is the best possible quality, and that people can trust in these results which hopefully you'll agree is a very valuable thing. All right, so let's get started with module one.
Stijn Debrouwere
Data Scientist
Biography
Stijn is a freelance data scientist, specializing in analytics. Previously at Fusion and The Guardian. He writes about statistics, metrics and the news industry at debrouwere.org.
Similar courses
Sign up for our Conversations with Data newsletter
Join 10.000 data journalism enthusiasts and receive a bi-weekly newsletter or access our newsletter archive here.