Name: DataJournalism.com
Price range: $

At its most core, the essential work of journalism is to gather and verify non-public information, evaluate its potential value to the public, and then -- if that value is substantial enough -- organise and publish it in such a way that it helps people make informed decisions about their lives. In a very important sense, this means that reporting and publishing around leaked data is no different than any other reporting: Once verified, the question of what to publish, and how, is driven principally by how it can best serve the public good.

Yet both the scale and detail of today's information leaks, especially when combined with a highly networked -- and therefore global -- publishing environment, means that modern leaks present substantive practical and the ethical considerations for journalists. In addition to protecting the source of the leaked information, journalists -- like everyone else -- must work in a digital environment where their activities are almost constantly and ubiquitously tracked, making it all too easy to inadvertently reveal the direction of an ongoing investigation. Moreover, because leaks are now often larger than any one journalist -- or journalistic organisation -- can typically handle, they present unique collaboration and publication challenges, all of which must be carefully engineered to balance efficacy, transparency, and privacy.

Despite these complexities, there are a range of useful methods and heuristics that can help journalists make ethical decisions about using secret, sensitive, or personal data. In fact, the most important -- and sometimes the most significant -- challenge is for journalists to accurately recognise when an ethical situation exists, especially in the fast-paced world of online publishing. For example, both our research and our professional experience indicates that when journalists accurately perceive the sensitivity of the information they have obtained, they handle it with more caution and care. Yet our work also indicates that journalists typically rely on their sources' judgment when assessing the sensitivity of the information they are being given. This suggests that while sources can, in general, trust journalists to handle leaked data carefully, it may also lead to problematic oversights when journalists report on data that has been simply ‘dumped’ online, especially since the sensitivity and/or significance of that data may not be immediately obvious.

Wikileaks has faced criticism for its data dumps, which have at times left sensitive and personal data exposed.

In our view, the mechanism through which data is obtained does not change the ethical standards with which it should be treated. If anything, sensitive information made accessible by a hack or a leak deserves more careful handling, since the agendas of those that have made the information public are uncertain at best. Moreover, the smash-and-grab data collection methods that are typical of hacks and breaches virtually ensures that the information of many private individuals -- whose only failing, often, is having been in a database along with someone or something of interest -- will be swept up along with anything potentially newsworthy. As such, journalists need to take particular care that their work does not implicate or injure those who are only 'guilty by association’. It is because of this, in part, that ‘leaked’ data may in fact demand more thoughtful handling than material provided by a confidential source.

Hacked, breached, or leaked: key considerations when reporting with private or personal data

As we noted above, journalists can typically be relied upon to think carefully about how they handle sensitive data provided by a trusted or confidential source, in part because a human source will often impress upon the journalist the risks of mishandling the information. The journalist's desire to preserve valuable source relationships (and her own reputation) may also help counterbalance the impulse to publish material that may be more salacious than newsworthy. When data is simply 'dumped' online with little or no context, however, this source-oriented conscientiousness may go out the window. Combine this with the pressure journalists may feel from editors or competitors to get a story out quickly, and it can be hard to justify the delay required to weigh how significant the story in leaked data set actually is.

To help illustrate how the lack of a ‘human’ source -- especially when coupled with the exciting and even illicit nature of leaked data -- can raise crucial ethical challenges, we'll look at the issues presented by the data made public through two prominent hacks: the Sony email hack and the Ashley Madison hack.

The Sony email hack: Hollywood gossip vs. equity reporting

In late November 2014, a massive dump of internal data from Sony Pictures was posted on the data-sharing site Pastebin, following a months-long hack of the entertainment company's systems. The hacked data contained everything from email exchanges to personnel files, and the question of what motivated the attack -- as well as who had perpetrated it -- quickly dominated headlines across the news spectrum. Yet many of the articles examining the contents of the leaked data also leaned toward the tabloid, focusing on executives' nasty exchanges and celebrity name-calling.

Although many may find it hard to muster sympathy for these powerful and high-profile individuals, the coverage's focus on the machinations of Hollywood dealmaking and franchise evolutions also threatened to overshadow many of the substantive issues revealed by the documents, such as industry-wide coordination on lobbying efforts designed to reshape the way that online content is served, or the fact that the data revealed the social security numbers, home addresses, and salaries of tens of thousands employees.

The emphasis in the total coverage of the Sony hack, moreover, stands in stark contrast to coverage of similarly controversial leaks like the Snowden documents, where the focus has remained on the actions of powerful companies and nation-states, rather than on the foibles and gaffes of the individuals involved.

The Ashley Madison hack: pitfalls of the moral high ground

Less than a year after the Sony Pictures hack, another high-profile hack and data dump made particularly intimate details of a large number of people's lives essentially public online. In this case, however, the individuals whose information was posted were not celebrities or Hollywood dealmakers, but people from all walks of life who had joined an online dating site purportedly designed to facilitate extramarital affairs.

Although the Ashley Madison hack generally contained less personally identifiable information than the Sony Pictures hack, the ramifications of the breach for those affected were sometimes devastating: multiple suicides were attributed to the hack, with widespread blackmailing campaigns, lost political careers, marriages, and community relationships resulting from the fallout. While there was some substantive reporting to be done on aspects of the leaked data -- for example, on the potentially inappropriate use of the service from government offices, and assessment of the fraud claims that had been made by former users -- reporting that simply plucked individuals out of the Ashley Madison databases and treated them as 'sources' for additional comment may well have done more to traumatise people who had already been victimised by the breach in the first place. And while there is no doubt that accountability reporting can sometimes have negative consequences for those whom it covers, as journalists we must make every reasonable effort to ensure that those consequences are reserved for those legitimately suspected of actual wrongdoing, and not simply on people whose choices may differ from our own.

Inverting implicit biases

A key component of the ethical challenge presented by cases like those above stems from our need, as journalists, to confront our own implicit biases. Whether or not we find the subjects of leaked data likable or even sympathetic, we must carefully weigh the news value of reporting with leaked data and the privacy interests of the people we are reporting on. As we’ve discussed, this can be particularly difficult to do when the data in question is simply posted online, since these datasets lack a human source to remind us of the potential ramifications of publishing. Importantly, this loss of context often also obscures the motives of the people who obtained and/or posted the hacked data in the first place -- something that should arguably be a core focus of the reporting on it.

By their very nature, of course, our own biases are difficult to counter. Consulting with colleagues and editors is always a good place to start sanity-checking our first judgements. Another strategy is to use a simple thought experiment: If the contents of the leak were something that we personally believe should be private’ -- perhaps individuals' HIV status, for example -- would we still report on it, and how? Especially when we consider that most of us are unwilling subjects of data collection in the first place, it's important that journalists consider how to minimise additional harm to any individuals whose personal information they are working with -- no matter how it came into their hands.

At times, however, reporting with sensitive and/or personal information is unavoidable, and may be essential to an important piece of accountability journalism. Where that is the case, there are still a number of ways that journalists can do the reporting and publishing that they need to while minimising potential harm as much as possible. Though today's networked data environment means there are few guarantees to be had, a clearly defined and thoroughly considered process -- especially one conducted in consultation with experienced colleagues -- can help journalists be confident in the appropriateness of their reporting and publication choices when dealing with hacked, leaked, and otherwise sensitive data.

Considerations for reporting

Good security helps ensure privacy protections

While we are unabashed advocates for strong information security practices, a key reason for this is the protections that they allow journalists to provide for their human sources as well as any potentially sensitive data resources that they may have. Although many companies affected by data leaks and breaches have far more resources to dedicate to security, journalists have demonstrated a unique ability to protect information. The first step in treating personal data ethically is to take reasonable precautions to ensure that it does not slip out of your control. A great resource for beginning to enhance your own security know-how is the Electronic Frontier Foundation's Surveillance Self-Defense website, which has everything from a security ‘starter pack’ to guides on particular tools.

EFF — There are many resources available online for journalists to start building up their security capability.

Take care when verifying

While many appeared to relish The Intercept's apparent missteps when verifying documents allegedly provided to them by NSA contractor Reality Leigh Winner, it can be hard to know what information may tip off interested parties as you verify elements of a story.

If you are relying on web searches, consider using a location-masking browser like Tor to make it more difficult for online service providers to infer what you are researching. If you are relying on human sources, consider what you know about the provenance of the data you are dealing with. If it is genuine, ask yourself: What is the likely position of someone who would have access to it, and what might lead another person to guess where it came from? Use your answers to guide what information you share with whom when verifying.

No matter the circumstance, however, it is always wiser to avoid sharing original documents. Instead, retype segments of content (correcting obvious spelling and punctuation errors) that you need to reveal -- many organisations will distribute sensitive documents with unique typographical or formatting features, so that they can pinpoint the source of an internal leak if originals show up online.

Trumpexecutive — To protect their source, Axios retyped the content of leaked White House schedules.

Finally, it's always a good idea review documents (especially PDFs, .doc/x files and spreadsheets) on an old computer that will never be connected to the internet (use a thumb drive to move them there). While this may seem cumbersome at first, it also helps protect your other information (and your organisation) from the malware and viruses that leaked data may contain. This is especially true if you plan to print hard copies of documents -- the ‘enable editing’ permission required for printing can also put your own computer's data at greater risk.

Considerations for publishing

How much is enough?

Providing reasonable privacy protections can sometimes seem at odds with imperatives around both accountability and transparency. When it comes to publishing data, however, the choice is not ‘all or nothing’. In fact, digital publishing gives journalists a range of ways to strike a balance between protecting private data and being transparent about their work.

Naturally, any data to be published must first be verified; this alone will limit the incidental exposure of personal information, since verification -- in addition to being a hallmark of responsible journalism -- is incredibly time-consuming.

Once verified, there is a question of relevance: Is the personal information you plan to publish essential to the story, or not? This can be a difficult question to answer. But just as we should reflect on why we might include information in a story about someone's age, race, immigration status, or other demographic attributes, we should relevance-test personal information we intend to publish. In short: Does the story really need it? This is especially important to consider given the ripple-effect of revealing personal information about someone. Family members, work colleagues, and -- in this age of social media -- even casual acquaintances may be affected by the revelation of your subject's personal details. Moreover, since most journalism today is inevitably published into a global context, the norms to consider when publishing personal details are not confined to a single region or culture. In general, you should be as conservative as possible without sacrificing the integrity of the story. Only after doing a thorough risk-benefit analysis, which keeps the costs to the individual at the center of your reasoning, should you make a determination about what to publish.

Balancing privacy and transparency

Just as journalism seeks to hold power to account, journalists should make themselves accountable to the public as well. Where possible, this means sharing data, sharing code, and providing detailed methodologies for the stories that you produce.

Yet, in many cases, wholesale publication of data or documents may violate the privacy of innocent people, or even put them at risk. In these cases, there are a number of methods that journalists can use to make key information public.

Redaction

Tools like DocumentCloud allow journalists to both publish documents and retain fine-grained control over them, offering both redaction and annotation tools. Journalists hoping to redact information from documents are advised to do so with tools designed for the purpose, as some methods (for example, drawing black boxes in Adobe, or ‘hiding’ Excel columns or rows) are easily undone. It's also important to keep in mind that simple methods of de-identification (such as removing names and addresses) are often insufficient to protect people's identities, given how much information is available to cross-reference online.

923px Redacted CIA document — An example of a redacted CIA document. Source: Wikimedia.

Is the personal information you plan to publish essential to the story, or not?

Samples and summaries

Another way that journalists can help protect individuals' privacy when publishing private data is to publish only a curated sample of data, or to publish data that has been summarised to the point that re-identification will be difficult, if not impossible. As above, however, the vast quantity of information available online means that both samples and summaries must be carefully designed to avoid revealing more than intended. We suggest consulting with a statistician if possible to help ensure that the measures taken are sufficient. A good place to start is to remove any obviously identifying information combinations, and then using the fully reported story to help guide your thinking about what samples or summaries will further clarify that story without inadvertently exposing individuals' information.

Visualisation

Visualisation is itself often a way of aggregating data in such a way that meaningful patterns are revealed in otherwise heterogeneous datasets. An advantage of visualisation for privacy protection is that a very small subset of data features are typically needed to create a visualisation, and, if well done, they add real value to the story being told. Static graphics, in addition to being more platform-friendly, also naturally limit the amount of inference and post-publication manipulation that is possible.

When the Journal News published an interactive map of gun permit holders following a school shooting in 2012, many were outraged by the apparent privacy invasion.

In 2011, the Guardian was careful to publish maps about the UK Riots such that the underlying information was not accessible, out of concern for the type of backlash that the Journal News ultimately suffered. Both of these maps are now currently offline.

Maps published with the Fusion series Suspect City illustrate where and how different age groups were stopped by police in Miami Gardens, Florida. Because many of those stopped were vulnerable or underage, these visualisations show the patterns of stops without revealing other personal details.

Selective release

Sometimes a journalist may find themselves in possession of a unique trove of data that they simply do not have the resources to investigate as thoroughly as they would like. Happily, however, digital technologies remove the requirement to choose between a ‘data dump’ and keeping information entirely to yourself. For example, academic researchers are increasingly using simple contracts to support useful data sharing while also helping protect their data from misuse. Similarly, during the initial phase of the Panama Papers reporting, journalists working on the project were required to sign contracts about how they would handle both the data and their reporting processes -- an approach that succeeded in keeping the work safeguarded until the stories were ready to publish.

If there are reasons that data cannot simply be published, journalists can still indicate that they are open to data-sharing requests. While this barrier alone may be sufficient to deter many bad actors, requiring anyone wishing to use the data to sign a simple contract agreeing not to use or disclose it in certain ways offers an additional layer of protection, as they could be held legally liable if they fail to protect the data as agreed. Although this approach obviously involves some risk, it often provides a good balance between protecting sensitive or personal information and allowing responsible parties to hold journalists to account.

Conclusion

As the breadth and complexity of the broader data environment continues to grow, so, too, do the ethical challenges around reporting and publishing with such data. While the core considerations of news value and the public interest help to answer questions about what journalists should cover and how, the nature and scale of digital leaks and digital publishing have introduced new ethical issues that often need to be examined. However, just as with many other processes in digital journalism -- like verification -- creating a thoughtful, well-defined process for evaluating leaked data and deciding how it will be handled goes a long way to ensuring that your reporting efforts are not only efficient, but ethically sound.

Hacker1 — **Additional resources**

For more on privacy and data journalism:

De-identification for data journalists

Data journalism and the ethics of publishing Twitter data

Ethical questions in data journalism and the power of online discussion

Longform reads

Verification Handbook

Data Journalism Handbook 2

New course

Quality journalism

Countering hate speech

New course

Video course

Fundamental search for journalists

Popular course

Coding

Python for journalists