Data scraping for stories
Conversations with Data: #12
Do you want to receive Conversations with Data? Subscribe
We’ve all been there. Searching, searching, and searching to no avail for that one dataset that you want to investigate. In these situations, the keen data journalist might have to take matters into their own hands and get scraping.
For those of you new to the area, data scraping is a process that allows you to extract content from a webpage using a specialised tool or by writing a piece of code. While it can be great if you’ve found the data you want online, scraping isn’t without challenges. Badly formatted HTML websites with little or no structural information, authentication systems that prevent automated access, and changes to a site’s markup, pose just some limitations to what can be scraped.
But that doesn’t mean it’s not worth a go! This edition of Conversations with Data brings together tips from scraping veterans Paul Bradshaw, Peter Aldhous, Mikołaj Mierzejewski, Maggie Lee, Gianna-Carina Grün and Erika Panuccio, to show you how it’s done.
Double check your code so you don’t miss any data
Peter Aldhous -- science reporter, BuzzFeed News
"I fairly regularly scrape the data I need using some fairly simple scripts to iterate across pages within a website and grab the elements I need.
- Why Are Dope-Addicted, Disgraced Doctors Running Our Drug Trials? (Here I scraped records of disciplinary actions against doctors in New York state from an earlier version of this site.)
- The Inside Track (Here I scraped authors, and other metadata, including scientific discipline and citation counts, from 10 years of papers published by the Proceedings of the National Academy of Sciences, more details here.)
- Why Track-And-Field Stars Don’t Set World Records Like They Used To (But Swimmers Do) (Here I scraped data on the 100 all-time top outdoor performances in many track-and-field events from the International Association of Athletics Federations website.)
- 'I Have The Best Words.' Here's How Trump’s First SOTU Compares To All The Others. (Here I scraped the full text of every State of the Union and other Presidential addresses to both houses of Congress from The American Presidency Project website; details of the analysis here.)
Advice: You need to pay a lot of attention to checking that you're getting all of the data. Subtle variations or glitches in the way websites are coded can throw you, and may mean that some gaps need to be filled manually. Use your browser's web inspector and carefully study the pages' source code to work out how the scraper needs to be written. Selectagadget is a useful Chrome browser extension that can highlight the CSS selectors needed to grab certain elements from a page."
Good quality scraping takes time, so communicate effectively and look for existing APIs
Mikołaj Mierzejewski -- data journalist at Gazeta Wyborcza, Poland’s biggest newspaper
"I think there are three different situations which everyone encounters in scraping data from websites:
- First one is the easiest - data is in plain HTML and you can use browser tools like Portia to scrape it.
- Second - data is trickier to obtain because it needs cookies/session preserving or data is loaded but it requires tinkering with developer tools in the browser to download.
- The third one is where you basically need a programmer onboard - it's when data is dynamically loaded as you interact with the site. He/She will develop a small application which will act as a browser to download the data. Most sites will allow downloading data at 0.75 seconds per request speed, but if you want to download loads of data, you will again need a programmer who will develop a more effective scraper.
One of the hardest parts of scraping is communicating your work to your non-technical colleagues, especially with non-technical managers. They need to know that good quality scraped data needs time because as you develop scrapers, you learn the inner workings of someone's web service and believe me that it can be a mess inside.
If you're curious about how we recently used data scraping here are links to articles showing data scraped from Instagram - we took posts that had '#wakacje' ('#vacation' in Polish) hashtag and put their geolocations on a map to see where Polish people spend their vacations. Articles are in Polish, but images are fascinating:
I'd also add one more thing regarding data scraping -- always look for APIs first, before getting your hands dirty with scraping. APIs may have request limits but using them will save you a lot of time, especially if you're in a prototyping phase. Postman and Insomnia are good tools for playing with APIs."
9 things to remember about scraping
Paul Bradshaw -- Course Director of the MA in Data Journalism at Birmingham City University and author of Scraping for Journalists
"Some thoughts about stories I've worked on that involved scraping...
Think about T&Cs: we wanted data from a property website but the T&Cs prohibited scraping - we approached them for the same data and in the end, they just agreed to allow us to scrape it ourselves. Needless to say, there will be times when a public interest argument outweighs the T&Cs too, so consult your organisation's legal side if you come up against it.
Use scraping as a second source: for this investigation into the scale of library cuts we used FOI requests to get information -- but we also used scraping to go through over 150 PDF reports to gather complementary data. It meant that we could compare the FOI requests to similar data supplied to an auditor.
If it has a pattern or structure it's probably scrapable: as part of a series of stories on rape by the Bureau of Investigative Journalism, we scraped reports for every police force. Each report used the same format and so it was possible to use a scraper to extract key numbers from each one.
Check the data isn't available without having to scrape it: the petitions website used for this story, for example, provides data as a JSON download, and in other cases, you may be able to identify the data being loaded from elsewhere by using Chrome's Inspector (as explained here, for example).
Do a random quality check: pick a random sample of data collected by the scraper and check them against the sources to make sure it's working properly.
Use sorting and pivot tables to surface unusual results: when scrapers make mistakes, they do so systematically, so you can usually find the mistakes by sorting each column of resulting data to find outliers. A pivot table will also show a summary which can help you do the same.
Scrape as much as possible first, then filter and clean later: scraping and cleaning are two separate processes and it's often easier to have access to the full, 'dirty' data from your scraper and then clean it in a second stage, rather than cleaning 'at source' while you scrape and potentially cleaning out information that may have been useful.
- Scrape more than once -- and identify information that is being removed or added: this investigation into Olympic torchbearers started with a scrape of over 5,000 stories - but once the first stories went live we noticed names being removed from the website, which led to further stories about attempts to cover up details we'd reported. More interestingly, I noticed details that were added for one day and then removed. Searching for more details on the names involved threw up leads that I wouldn't have otherwise spotted."
Use scrapers as a monitoring tool
Maggie Lee -- freelance state and local government reporter in Atlanta
"I don't have a 'big' story, but I'd put in an endorsement for scrapers as monitors, as a thing to save beat reporters' time.
For example, I wrote a scraper for a newspaper that checks their county jail booking page every half-hour. That scraper emails the newsroom when there's a new booking for a big crime like murder. Or the reporters can set it to look for specific names. If "Maggie Lee" is a suspect on the run, they can tell the monitor to send an email if "Maggie Lee" is booked. It just saves them the time of checking the jail site day and night. The newsroom uses it every day in beat reporting.
For another example, I have a scraper that checks for the city of Atlanta audits that get posted online. It emails me when there's a new audit. Not every audit is worth a story, but as a city hall reporter, I need to read every audit anyway. So, with this scraper, I don't have to remember to check the city auditor's site every week."
Make sure your scraper is resilient, and always have a backup plan
Gianna-Carina Grün -- head of data journalism at DW
"These two stories of ours relied on scraping:
Here we scraped the country pages of the EU Trust Fund for Africa for information on projects in these countries.
If you have a scraper running regularly, you have to design it in a way that small changes by the data providers in wording on the page or within the data itself will not break your code. When you write your scraper, you should try to make it as resilient as possible.
The scraper code can be found here.
Here we scraped multiple sources:
- to get player names and club affiliations out of the PDFs provided by FIFA
- to get information on in which league a club played
- to get information on which player played during the World Cup
- to get information on how each team scored during the World Cup
When we first did the test run with the WorldCup 2014 data, FIFA provided information on 1, 3 and 4 - and we hoped that we'd get the data in the same formats. Other data teams tried to figure out in advance with FIFA what the 2018 data format would look like (which is a useful thing to try). We planned for the worst case - that FIFA would not provide the data in the needed format and relied on our 'backup plan' of other data sources we could get the same information from.
Code for all scrapers can be found here."
Don’t forget to save scraped data
Erika Panuccio -- communications assistant at ALTIS
"I faced issues related to data scraping when I was working on my Master's degree thesis. I had to collect data about pro-ISIS users on Twitter and I ended up with a database of about 30,000 tweets from about 100 users. I chose the accounts I wanted to analyse and then used an automated IFTTT platform to save the tweets whenever they were published, storing them in spreadsheet format. In this way, I could keep the data even if the account was suspended (which happened very frequently because of Twitter's policy on terrorist propaganda) or if the owner deleted the tweets."
Our next conversation
AMA with Jeff Larson, The Markup
With a plan to launch in early 2019 as a nonprofit newsroom, The Markup will produce data-centred journalism on technologies, and how changing uses affect individuals and society.
Until next time,
Madolyn from the EJC Data team
If you experience any other problems, feel free to contact us at firstname.lastname@example.org