Over the last decade, one of the goals of data journalism has been to increase accountability and transparency through the release of raw data. Admonitions of “show your work” have become common enough that academics judge our work by the datasets we link to. These goals were admirable and (in the context of legitimizing data teams within legacy organizations) even necessary at the time. But in an age of 8chan, Gamergate, and the rise of violent white nationalism, it may be time to add nuance to our approach.
For this discussion, I'm primarily concerned with the publication of personal data (also known as personally-identifiable information, or PII). In other words, we’re talking about names, addresses or contact info, lat/long coordinates and other geodata, ID numbers (including license plates or other government IDs), and other data points that can be traced back to a single individual.
Much of this is available already under the public record, but that’s no excuse: as the NYT Editorial Board wrote in 2018, “just because information is public doesn’t mean it has to be so easy for so many people to get.” It is irresponsible to amplify information without thinking about what we’re amplifying and why.
For this discussion, I'm primarily concerned with the publication of personal data (also known as personally-identifiable information, or PII).
Moreover, the idea that journalists could contribute to personal data leaks isn't theoretical: many newsroom projects start with large-scale FOIA dumps or public databases, which may include exactly this personal data. There have been movements in recent years to monetize these databases--creating a queryable database of government salaries, for example, and offering it via a subscription or using it as a source of reliable traffic from rubberneckers. Even random public records requests may disclose personal data. Intentionally or not, we’re swimming in this stuff and have become jaded as to its prevalence. Is it right for us to simply push it out without re-examining the implications of doing so?
I would stress that I’m not the only person who has thought about these things, and there are a few signs that we as an industry are beginning to formalize our thought process in the same way that we have standards around traditional reporting:
The Markup’s ethics policy contains guidelines on personal data, including a requirement to set an expiration date (after which point it is deleted).
Reveal’s ethics guide doesn’t contain specific data guidelines but does call out the need to protect individual privacy: “Recognize that private people have a greater right to control information about themselves than do public officials and others who seek power, influence or attention. Only an overriding public need can justify intrusion into anyone’s privacy.”
The AP no longer reports the names or runs stories on mugshots for minor crimes.
The New York Times ran a session at NICAR 2019 on “doxxing yourself,”in part to raise awareness of how vulnerable reporters (and by extension, readers) may be too targeted harassment and tracking.
A 2016 SRCCON session on “You’re The Reason My Name Is On Google: The Ethics Of Publishing Public Data” explored real-world lessons from the Texas Tribune’s salary databases (transcript here).
Poynter wrote about the conflicts and difficulties that journalists have when publishing personal data all the way back in 2013.
A data-rich environment is dangerous
In her landmark 2015 book The Internet of Garbage, Sarah Jeong sets aside an entire chapter just for harassment. And with good reason: the Internet has enabled new innovations for old prejudices, including SWATting, doxing, and targeted threats at a new kind of scale. Writing about Gamergate, she notes that the action of its instigator, Eron Gjoni, “was both complicated and simple, old and new. He had managed to crowdsource domestic abuse.”
More recently, until it was driven off of its CDN provider, the Kiwi Farms forum served as a home base for digital bullying, as posters there would pick vulnerable targets (especially those who were LGBTQ), indiscriminately collect information about them by scouring different web sources, and then attempt to hound them into suicide or retreat from public life. KF was not known for being particularly good at gathering information, but they didn't need to be: accuracy is not the point of a harassment campaign, and collateral damage was something it was happy to encourage.
I'm focusing on harassment here because I think it provides an easy touchstone for the potential dangers of publishing personal information. Since Latanya Sweeney’s initial work on de-anonymizing data, an entire industry has grown up around taking disparate pieces of information, both public and private, and matching them against each other to create alarmingly-detailed profiles of individual people. This is the foundation of the business model for Facebook, as well as a broad swathe of other technology companies. This information includes your location over time. And it’s available for purchase, relatively cheaply, by anyone who wants to target you or your family. Should we contribute, even in a minor way, to that ecosystem?
These may seem like distant or abstract risks, but that may be because, for many of us, this harassment is more distant or abstract than it is for others. A survey of “news nerds” in 2017 found that more than half are male, and three-quarters are white (a demographic that includes myself). As a result of this background, many newsrooms have a serious blind spot when it comes to understanding how their work may be seen (or used against) underrepresented populations.
In particular, as rhetoric has ramped up over the last decade, it's become clear that newsrooms are not listening to the few trans journalists in their ranks. When the US "paper of record" fights back against updating historical bylines that contain their own reporters' deadnames, it sends a clear message about whose data matters, whose doesn't, and how seriously the institution takes the threat of personal metadata.
We are very bad as an industry at thinking about how our power to amplify and focus attention is used. Even if harassment is not the ultimate result, publishing personal data may be seen by our audience as creepy or intrusive. At a time when we are concerned with trust in media, and when that trust is under attack from the top levels of government, more care is necessary.
Names and shame
Ultimately, I think it is useful to consider our twin relationship to power and shame. Although we don’t often think of it this way, the latter is often a powerful tool in our investigative reporting. After all, as the fourth estate, we do not have the ability to prosecute crimes or create legislation. What we can do is highlight the contrast between the world as we want it to be and as it actually is, and that gulf is expressed through shame.
The difference between tabloid reporting and “legitimate” journalism is the direction that shame is directed. The latter targets its shame toward the powerful, while the former is as likely to shame the powerless. In terms of accountability, it orients our power against the system, not toward individual people. It’s the difference between reporting on welfare recipients buying marijuana, as opposed to looking at how marijuana licensing perpetuates historical inequalities from the drug war.
Our audiences may not consciously understand the role that shame plays in our journalism, but they know it’s a part of the work. They know we don’t do investigations in order to hand out compliments and community service awards. When we choose to put the names of individuals next to our reporting, we may be doing it for a variety of good reasons (perhaps we worked hard for that data, or sued to get it) but we should be aware that it is often seen as an implication of guilt on the part of the people within.
In the small Virginia county where I went to high school, the local right-wing newspaper would publish the salaries of every teacher in the local public school system. There was no explicit threat of violence, but it was meant to feel invasive and hostile, and it did. When I worked at the Seattle Times and had conversations with editors about potentially creating a salary database for Washington State, it was hard to capture the difference between what we were doing, and what that Virginia paper had attempted to do. For the people named in those kinds of databases, it probably doesn't feel like there's really a difference at all.
Toward a philosophy of PII in reporting
I want to be very clear that I am only talking about the public release of data when I ask for increased caution. I am not arguing that we should not submit FOIA or public records requests for personal data or that it can’t be useful for reporting. I’m also not arguing that we should not distribute this data at all, in aggregated form, on request, or through inter-organizational channels. It is important for us to show our work and to provide transparency. I’m simply arguing that we don’t always need to release raw data containing personal information directly to the public.
In the spirit of Maciej Ceglowski’s Haunted by Data, I’d like to propose we think of personal data in three escalating levels of caution:
- Don’t collect it!
When creating our own datasets, it may be best to avoid personal data in the first place. Remember, you don’t have to think about the implications of the GDPR or data leaks if you never have that information. When designing forms for story call-outs, try to find ways to automatically aggregate or avoid collecting information you will not use during reporting. I will note that this is often a tougher decision than it seems – consider, for example, the source diversity tracking that many newsrooms are now attempting to incorporate to diversify their coverage, which by extension often means gathering (and retaining) some degree of identifying data.
- Don’t dump it!
If you have the raw data, don’t just throw it out into the public eye because you can. In general, we don’t work with raw data for reporting anyway: we work with aggregates or subsets because that’s where the best stories live. What’s the difference in policy effects between population groups? What department has the widest salary range in a city government? Where did a disaster cause the most damage? Releasing data in an aggregate form still allows end-users to check your work or perform follow-ups. And you can make the full dataset available if people reach out to you specifically over e-mail or secure channels (but you’ll be surprised how few actually do). Note that even aggregated or anonymized datasets may be vulnerable to so-called Database Reconstruction Attacks.
- Don’t leave it raw!
In cases where distributing individual rows of data is something you’re committed to doing, consider ways to protect the people inside the data by anonymizing it without removing its potential usefulness. One approach that I love from ProPublica Illinois’ parking ticket data is the use of one-way hash functions to create consistent (but anonymous) identifiers from license plates: the input always creates the same output, so you can still aggregate by a particular car, but you can’t turn that random-looking string of numbers and letters back into an actual license plate. As opposed to “cooking” the data, we can think of this as “seasoning” it, much as we would “salt” a hash function. A similar approach was used in the infosec community in 2016 to identify and confirm sexual abusers in public without actually posting their names (thus opening the victims up to retaliation).
Policies, guidelines, and organizational support
The three "don'ts" above are rules that you can adopt for yourself or for a team that you lead inside a newsroom. They don't require institutional buy-in – they're like your code style or your team's best practice documents. But while ethics exist for ourselves, they are also historically a way that newsrooms establish trust and relationships with a community (see also: the original development of "objectivity" as a method of verifying factual truths, not as an abstention from political life). And that means having a public policy.
At NPR, sometime around 2019, I started the process of developing a set of public guidelines for the News Apps team. Unfortunately, I wasn't able to complete it before leaving the organization in 2021, due mostly to conflicts in scheduling and newsroom staffing availability. As a result, I can offer some guidance on what surfaced during this process so you may get further than I did.
First of all, this is a conversation with many stakeholders. You'll want to talk with your legal department or counsel about any issues that they can foresee (including which terminology may prove binding, such as "policy" vs "guidelines"). You'll also want to bring in your standards department or copy chief, whoever is in charge of normally making coverage decisions. Your goal as a data journalist isn't to be the final decision maker but to be able to help inform the decisions that data-shy editors may need to make.
Second, try to think about your process holistically. Your newsroom may already have a policy for redacting names from coverage or archives when there's a credible threat of violence or when circumstances have changed (say, a non-notable person is accused of a crime and later found innocent, but the original coverage still surfaces in searches for their name). I assume you're also already following (or are at least aware of) the Trans Journalists Association style guide in terms of policies on redacting or altering deadnames. Having a personal data policy is a great way to unify your organization's approach when covering trans communities, people in the criminal justice system, and other communities that are normally shy about being in the journalistic spotlight.
Third, try to think about how non-aggregate data is dangerous in combination and who has access to it. The ability to link names with addresses, geolocation or birth dates gives potential harassers more leeway to combine it with information obtained elsewhere. If it's possible to routinely delete or archive data in an inaccessible place at a preset time after publication (say, 90 days), you can lower the possibility of misuse by staff or inadvertent leaks. Internal systems, such as source diversity audits, should be designed so that reporters and staff cannot access the original data if it is retained.
Finally, always consider how a public guideline or policy could be used by people acting in bad faith. For example, take into account those who will use the policy to try to make reporting on their actions more difficult, to manipulate the tone of your journalism or public figures who try to get coverage stricken. Spend time imagining how someone might try to abuse your rules, and then have conversations about how to respond ahead of time so that you're not trying to figure those situations out under pressure when – not if – it happens.
Once upon a time, this industry thought of computer-assisted reporting as a new kind of neutral standard: “precision” or “scientific” journalism. Yet as Catherine D’Ignazio and Lauren Klein point out in Data Feminism, CAR is not neutral, and neither is the way that the underlying data is collected, visualized, and distributed. Instead, like all journalism, it is affected by concerns of race, gender, sexual identity, class, and justice.
It is also, for better or worse, often an extractive process. Databases serve as the ultimate way to parachute into a community and make pronouncements about it, specifically because they do often feel so all-encompassing. On Chalkbeat's data team, we have tried to be conscious of the temptation to treat spreadsheets and public information as the story itself instead of relying on reporters who know and understand the locals' concerns, and can perform journalism with the community, not just on it. More importantly, we know that the way we publish today affects our ability to report within communities in the future, especially if they believe we're contributing to harassment, shaming, or abusive policy.
Incorporating an opinionated personal data strategy into our work gives data journalism a way to think about community-building and engagement. On a personal level, we can practice restraint in what we collect, dump, and publish in a raw form. As organizations, it's possible to create strong public commitments and policies on how we will handle identifying information for individuals.
Book: The Internet of Garbage, by Sarah Jeong. As a comprehensive cross-section of how harassment, spam, and copyright collide on the Internet, it's hard to top Jeong's book, which details not only the complicated problems of each, but also how they bleed into each other and interact in complex ways.
Article: Taking care with source security when reporting on abortion, by Olivia Martin, Martin Shelton, and Jessica Bruder. It's worth remembering that not only do we as journalists need to think about the bulk data we release, but also understand our reporting in the context of bulk data that could be used to identify and even prosecute, our sources.
Article: How Data Journalists Can Use Anonymization to Protect Privacy, by Vojtech Sedlak. A good overview of techniques that you can use to season or de-identify a database, including helpful notes on how techniques can be broken or circumvented to re-identify subjects.