Conversations with Data: #17
Do you want to receive Conversations with Data? Subscribe
It’s been a while since we last spoke -- but we haven’t forgotten you. Over the past few weeks, we’ve spent some time reflecting on our first year of Conversations with Data, which saw us feature advice from 75 individual community members and 17 expert AMAs. It’s been great talking with you, and we want to keep our conversations relevant. Be sure to let us know what you’d like us to cover, or who you’d like us to chat to.
As we gear up to reproduce another great year, we thought it would be fitting to look at journalism that does just that. So, for this conversation, we got talking with Jeremy Singer-Vine, Stephen Stirling, Timo Grossenbacher, and more, about how to set your stories up for repeatability.
What you said
1. Documentation is everything
If there was ever a common theme across your responses, it was this: document, document, document.
In the words of BuzzFeed’s Jeremy Singer-Vine, “reproducibility is about more than just code and software. It's also fundamentally about documentation and explanation. An illegible script that ‘reproduces’ your findings isn't much use to you or anyone else; it should be clear -- from a written methodology, code comments, or another context -- what your code does, and why”.
To keep good documentation, Stephen Stirling from NJ Advance Media suggested reminding yourself that others need to understand your process.
“Keeping regular documentation of steps, problems, etc. not only keeps your collaborators in the loop, but provides instruction for others hoping to do the same work. Documentation on the fly isn’t always going to be pristine, so make a point to go back and tidy it up before you make available to others.”
2. Take time to develop good documentation practices
While we’re agreed that documenting is crucial, what does good documentation look like in practice? As our Handbook chapter author, Sam Leon, explained there are many ways to make data reporting reproducible.
“Often just clearly explaining the methodology with an accompanying Excel spreadsheet can enable others to follow precisely the steps you took in composing your analysis. In my chapter in the Handbook, I discuss the benefits of using so-called ‘literate programming’ environments like Jupyter Notebooks for reproducibility,” he said.
There’s no set workflow, and often it may be a matter of tailoring your approach to your project’s needs. Brian Bowling shared some thoughts from his experience:
“A change log would be a good idea for a longer project, particularly when other people will be using or viewing the workbook. When I mainly worked in Excel, I didn't do that. As I branched out into R, Python and MySQL, I started keeping Word documents that included data importing and cleanings steps, changes, major queries used to extract information and to-do lists. Since I wasn't working exclusively in one piece of software, keeping separate documentation seemed like a better idea. It also makes it easier to pull the documentation up on one screen while working on another.”
For those of you working in Excel, have a go at using Jeff South’s solution for keeping track of his workflows:
“Every time I make a significant modification to data, I copy it to a new worksheet within the same file, and I give the new worksheet a logical name ("clean up headers" or "calculate crime rates"). Also, I insert comments on columns (and sometimes on rows) to document something I've done, like "Sorted by crime rates and filtered out null values").
This system means that all changes are clearly documented in one file and you don’t have multiple files with similar names.
3. Consider your software
Timo Grossenbacher, SRF Data, warned: “Scripts may fail to execute only months after initial creation -- even worse, they suddenly produce (slightly) different results without somebody noticing. This is due to rapid development of software (updated packages, updated R / Python version, etc.). Make sure to at least note down the exact software versions you used during initial script creation.”
If you’re using R, Timo’s created an automated solution for version control, which Gerald Gartner said has helped Addendum avoid problems with different versions of packages. They’ve moved all of their analysis to R and now, he said, “the 'pre-R-phase' feels like the dark ages.”
At BuzzFeed, they also use R, along with Python. Jeremy Singer-Vine suggested going for these types of ‘scriptable’ software rather than point-and-click tools. That said, “if you're wed to a point-and-click tool, learn about its reproducibility features, such as OpenRefine's 'Perform Operations'".
4. Make use of data packages
Serah Njambi Rono, from Open Knowledge International, reminded us that datasets are constantly evolving -- and this can limit successful reproducibility.
“Outdated versions of datasets are not always made available alongside the updated datasets - a common (and unfortunate) practice involves overwriting old information with new information, making it harder to fact check stories based on older datasets after a period of time.
Frictionless Data advocates for the use of data packages in answer to the latter issue. Data packages allow you to collate related datasets and their metadata in one place before sharing them. This means that you can package different versions of the same dataset, provide context for them, publish them alongside your reporting and update them as needed, while allowing for repeatability.”