Crowdsourcing Data at the Guardian Datablog
Written by: Marianne Bouchart
Crowdsourcing, according to Wikipedia, is "a distributed problem-solving and production process that involves outsourcing tasks to a network of people, also known as the crowd". The following is from an interview with Simon Rogers on how the Datablog used crowdsourcing to cover the MPs expenses scandal, drug use, and the Sarah Palin papers:
Sometimes you will get a ton of files, statistics, or reports which it is impossible for one person to go through. Also you may get hold of material that is inaccessible or in a bad format and you aren’t able to do much with it. This is where crowdsourcing can help.
One thing the Guardian has got is lots of readers, lots of pairs of eyes. If there is an interesting project where we need input, then we can ask them to help us. That is what we did with the MPs Expenses. We had 450,000 documents and very little time to do anything. So what better way than open up the task to our readership?
The MPs Expenses project generated lots of tip-offs. We got more stories than data. The project was remarkably successful in terms of traffic. People really liked it.
We are currently doing something with MixMag on drug use, which has been phenomenal as well. It looks like it is going to be bigger than the British crime survey in terms of how many people come back to it, which is brilliant.
What both of these projects have in common is that they are about issues that people really care about, so they are willing to spend time on them. A lot of the crowdsourcing we have done relies on help from obsessives. With the MPs expenses we had a massive amount of traffic at the beginning and it really died down. But what we still have people that are obsessively going through every page looking for anomalies and stories. One person has done 30,000 pages. They know a lot of stuff.
We also used crowdsourcing with the Sarah Palin papers. Again this was a great help in scouring the raw information for stories.
In terms of generating stories crowdsourcing has worked really well for us. People really liked it and it made the Guardian ‘look good’. But in terms of generating data, we haven’t used crowdsourcing so much.
Some of the crowdsourcing projects that we’ve done that have worked really well have been more like old fashioned surveys. When you are asking people about their experience, about their lives, about what they’ve done, they work very well because people aren’t as likely to make that up. They will say what they feel. When we asked people to kind of do our job for us, you have to find a framework for people to produce the data in a way you can trust them.
Regarding the reliability of data, I think the approach that Old Weather have got is really good. They get ten people to do each entry, which is a good way to ensure accuracy. With the MPs expenses, we tried to minimise the risk of MPs going online and editing their own records to make themselves look better. But you can’t permanently guard against this. You can only really look out for certain URLs or if it’s coming from the SW1 area of London. So that’s a bit trickier. The data we were getting out was not always reliable. Even though stories were great, it wasn’t producing raw numbers that we could confidently use.
If I were to give advice to aspiring data journalists who want to use crowdsourcing to collecting data, I would encourage them do this on something that people really care about, and will continue to care about when it stops making front page headlines. Also if you make something more like a game this can really help to engage people. When we did the expenses story a second time it was much more like a game with individual tasks for people to do. It really helped to give people specific tasks. That made a big difference because I think if you just present people with the mountain of information to go through and say ‘go through this’ it can make for hard and rather unrewarding work. So I think making it fun is really important.