As journalists, we are committed to being the watchdogs who call government and companies to account. A new field of accountability is the use of algorithms. These are employed in multiple facets of our lives -- often without our knowledge -- to determine prison sentencing, hiring, whether someone should be granted a loan, and other societal decisions.
As computational methods also become more prevalent in the newsroom, such as data journalism, curated news feeds, automated writing, social media analytics, and news recommender systems, we must hold ourselves to the same standards of accountability and transparency.
To begin this process, I presented a paper at the Computation+Journalism 2016 conference that examines how to develop standards and expectations for transparency.
Benefits of editorial transparency
Releasing documented code and data does sound like extra work, and it is. But by investing this extra time, we benefit ourselves, our field, and our readers.
Writing code that others will see, and potentially criticise or question, encourages us to write cleaner code, with appropriate commenting, logical organisation, and visualisations.
Adopting this process ensures that what we are reporting is based on evidence, and that this evidence is present and correct. And when our code and data are open, as with most open source fields, rapid development within that field occurs.
The field of journalism benefits from an educational standpoint, where journalists can learn from clear, well-documented code. Journalists can access the data and create something new from it that they had not considered, had no time for, or to tell a local story.
Editorial transparency also benefits our readers. Particularly in these times, establishing and maintaining trust with readers is tantamount to the continued success of journalism. Part of building that trust is enabling readers to check your work or see the steps that lead you to your story, facts, and conclusions.
Just as we cite our sources, we should provide evidence for our data journalism also. We allow readers to engage with our work more by providing the code and the data, as well as the opportunity for new stories, or even corrections if errors are found in our code.
For example, below are two case studies of my own work using transparency tools that are free and open source.
Uber: A data journalism project investigating uberX wait times across demographics in Washington, DC.
To open up our process:
- data was shared via a lab-account Google drive (Google Drive is great if the data set is too large to upload to GitHub.)
- data (cleaned and processed raw data) was shared as a .csv file within the GitHub repository enabling others to pick up the data analysis at later stages if desired
- data analysis code was shared in commented Jupyter notebooks and python scripts in the GitHub repository
- project and code documentation, the data dictionary, and other experimental particulars were described in the readme
- everything that could be achieved programmatically was done programmatically, rather than manually, to enable replicability and facilitate reproducibility
- the Google drive and the GitHub repository were linked in the news article
- the news article was linked in the GitHub repository.
In doing this we:
- were accountable so that others could inspect our code, data, and assumptions -- we were even notified of a bug in our code via the ‘Issues’ affordance in GitHub
- facilitated several independent policy studies in other States and cities based on our code
- enabled others to conduct novel studies and data visualisations, including this one by Kate Rabinowitz.
Here’s how we ran the project to promote transparency:
- code is available on GitHub
- usage and customisation instructions and documentation are available in the README.md file
- the tool is independent of platform (Mac, PC, Linux) -- all that is necessary to create one’s own bot is to install the required python libraries, edit the configuration file, and run it on a server like AWS.
We aimed to make our tool as accessible to others as possible, so that any newsroom or individual could make a copy of the code, and customise the configuration settings to create their own bot. It was challenging to discern how much customisation to build into the tool, while ensuring that it wasn’t so flexible as to render it too complicated or to dilute the specific goal of the bot.
Other news organisations’ Github examples:
- Buzzfeed News often share their data analysis, libraries and tools on their GitHub account, including their recent story about spy planes
- ProPublica shares many tools and stories, including one on machine bias
These are just some examples of newsrooms sharing, although not all of them come with documentation or links to the articles they were published in. Remember: simply sharing code is not equivalent to transparency.
What does documentation entail?
- Journalists should comment on code, explaining what each line, code block, or function does.
- Writing code in Jupyter Notebooks can be helpful if the project is in Python, R, or Julia, as HTML and Markdown text can be added in between code blocks to provide context or explanation. Graphics can also be displayed inline with the code, and it can be viewed online without requiring you to install specific software.
- Write a README.md for your GitHub repository that provides context for the study, links to the article and to the data, a list of code dependencies (that is, which code libraries were used), a data dictionary, and any other information that may assist someone in following the code. Consider linking to these within the news article.
- Try linking to reference material for any APIs or external software used.
This saves you from re-writing instructions on how to use these APIs. For example, the comment bot collects comments from Disqus forums and, rather than explaining how to set up a Disqus API account, I referred the reader to the Disqus documentation.
Considerations when sharing
Each project will have its own unique considerations. Sometimes sharing the data in its raw form will not be possible because of privacy issues. In these cases, it might be possible to share aggregated or cleaned data that no longer contains personal identifiable information. In other cases, the data used may be proprietary or have been provided to you with restrictions or an agreement. Sharing this data will of course not be possible, and a statement could be made to that effect. If the data itself cannot be shared, still consider sharing the code used to clean the data and analyse it. If any graphics in the article were created using code, share that too.
Sharing code is great, but for others to use your code, it has to come with a licence. There are many different ones to choose from depending on whether you’re sharing code, data, or a mixture of the two.
Check out these links to help you choose:
For more detail on building transparency and accountability in computational journalism, read the full research paper here.