Write a response

Uncovering the truth: Exploring the benefits of federated databases for policing records

Cross-domain collaboration is way forward for covering police at scale

In the past four years, officers in Bakersfield, California have broken 31 bones. In every one of those cases, the officers involved received no discipline. This startling finding was only uncovered because of an unprecedented cross-domain collaboration to make police records transparent. In Washington, D.C., The Washington Post partnered with the Investigative Reporting Program at the University of California-Berkeley, publishing The Unseen Toll of Nonfatal Police Shootings and building on its already impressive Fatal Force database.

And across the United States, some 20 different news organizations and journalists are working with Big Local News out of Stanford University to collect police decertification records, building up a repository that will help anyone trying to track problem officers.

Such collaborative data journalism efforts are international in scope as well. In Brazil, for example, a small team of reporters from the news organization Ponte partnered with Marcelo Soares, a leading data journalist, to expand their abilities and the scope of their work into policing issues.

“Ponte has a capable, diverse and small team of human rights-minded reporters, who hit hard and take no guff,” Soares wrote in an email. “What they didn't have was too much experience handling data analysis.”

“A story they could publish after literally our second class was this one, "Deaths Without Color". They used São Paulo data on police killings and showed how the police gradually stopped recording the ethnicity of many people they killed,” Soares added. “It grows month by month after the deaths of George Floyd in the U.S. and Beto Freitas in the south of Brazil.”

Now, this type of work is expanding in a new way, moving from a web of journalists working together to include others outside journalism: data scientists, advocates, and criminal defense lawyers. In California, the Community Law Enforcement Accountability Network is building a model for how public defenders, community advocates, data scientists and investigative journalists can work together to uncover and expose police records.

Even with cross-domain collaboration, the challenges are huge. But without such partnerships, the task would be impossible, said Barry Scheck, the founder of The Innocence Project, a nonprofit organization that works to free those wrongfully convicted.

“Working across domains is, of course, essential because that’s the only way we are going to accurately and reliably identify and gather and share the data. So, you are best able to do that when you are working across domains and with different groups that are all engaged in the same enterprise,” Scheck said. “The second part that I’ve really come to appreciate is the critical function that civil society plays when trying to perform oversight of the policing function. It [policing] really is embedded in secrecy and has been, and I don’t see any other way to effectively get this out into the open.”

So far, the effort has obtained 165,000 records all through public record requests from around 700 agencies across the state of California, said Lisa Pickoff-White, a visiting senior data journalist at Big Local News and KQED data journalist. Entering just one complete case can take anywhere from an hour to four hours.

But working across disciplines may soon pay off. Data scientists at Berkeley are helping to build out tools that will make extracting key facts easier, a human-in-the-loop machine learning system. Meanwhile, journalism interns at Stanford and the Investigative Reporting Program at Berkeley hand-enter and verify data from the cases and then help report out stories. That’s how the Bakersfield story on police breaking bones was produced. The goal: keep scaling up and publishing critical accountability journalism along the way, all done with collaborative partners, from lawyers to data scientists.

Captura de pantalla 2022 12 02 a las 11 42 43

The roots of a policing collaboration

In 2014, during my first year teaching data journalism at Stanford University, I gave my students a public records assignment: request police stop data from state patrol agencies across the United States. A few months and millions of traffic stops later; I started a conversation with Sharad Goel, an engineering professor with expertise in criminal justice and racial bias (now at Harvard University). Within the year, we launched the interdisciplinary Stanford Open Policing Project. In the years since, we have advised police agencies, trained hundreds of journalists on how to analyze the policing data for their own stories and seen law and policy changes from cities Nashville to California.

The Open Policing Project spurred the idea of another collaborative effort, Big Local News, a data-sharing platform that makes data-driven accountability journalism easier to achieve for local newsrooms.

Around the same time, a new California law spurred another collaboration. “Lawmakers passed the landmark “Right to Know Act” in 2018, chipping away at a four-decade wall of secrecy concerning police internal investigations and officer discipline in California,” according to what is now known as the California Reporting Project. “Six founding organizations joined together to seek the transparency that SB 1421 promised.”

Those initial six news organizations grew to 40, and they joined forces to sue when necessary and to work together on stories they couldn’t have done alone. In 2019, the collaboration morphed again, spurred by Scheck's interest. Some 70 people, data scientists, journalists, lawyers, advocates and more, came together at Stanford that fall to discuss how best to continue the effort toward police transparency.

Captura de pantalla 2022 12 02 a las 11 49 58

Already, the journalists were facing daunting challenges, from negotiating for records to technical issues with working with the records once obtained.

The solution: work across domains to build an infrastructure that will support a variety of needs and that will foster that same transparency in police accountability. The new effort: The Community Law Enforcement Accountability Network (CLEAN). The partners include Big Local News, the California Reporting Project and its newsroom partners; criminal justice advocacy groups, such as the ACLU, the National Association of Criminal Defense Lawyers, the Innocence Project; and data scientists from the University of California at Berkeley.

“Working across domains is of course essential,” Scheck said. “Because that’s the only way we are going to accurately and reliably identify and gather and share the data.”

The negotiation

Now, every week, Big Local News Journalist Phoebe Barghouty sends emails and has phone call after phone call with local government clerks and information request managers about the status of requests for the California Reporting Project. Sometimes she leads student training on how to file requests. On other days, she confers with lawyers on which cases may end up as part of a lawsuit challenging denials by a police agency. Just this October, KQED, one of the California Reporting Project partners, sued to obtain critical records from the California Department of Corrections.

Barghouty works with journalism students at Stanford and Berkeley, along with other partners, to keep the public records moving along. (Those same students work on stories too.)

Finally, Barghouty helps import and organize gigabytes of documents, audio and video files. Every week, she meets with Pickoff-White to chart the next steps.

“We are possibly one of the largest projects that’s doing requests at scale,” said Pickoff-White. Just the act of requesting police use of force and misconduct cases is daunting. The journalists work with the nonprofit public records organization, MuckRock, to stay organized on records requests, yet another partner in our sprawling collaborative effort. Once documents come in, the journalists and students then upload all those records to DocumentCloud. From there, the partners begin the process of hand-entering dozens of fields of information into a database. And the information in that database is analyzed for possible news stories, one jurisdiction at a time.

“We’ve been able to request so much but we’re still figuring out how to process it all,” said Pickoff-White.

From documents into data

Figuring out “how to process it all” is where the Berkeley Institute for Data Science comes into the picture. Led by Nobel Laureate Saul Perlmutter, BIDS is partnering with journalists and lawyers alike to provide the tools to move beyond manual data entry. Professors Joe Hellerstein and Sarah Chasins of the Department of Electrical Engineering and Computer Sciences and Aditya Parameswaram of the School of Information and EECS, are developing new methodologies in AI, databases, human-computer interaction, and visualization to enable the work. They now routinely meet with data and investigative journalists to sort through the daunting technical challenges. Some of their PhD students worked with investigative reporting students on clustering algorithms, for example, to better identify themes in the policing documents.

The work by BIDS with the effort started with a conversation between Scheck and Perlmutter, not long after George Floyd was killed by police officers in Minneapolis in May 2020.

Collaboration with scientists has already become part of the model for data journalism.

In August, The Places Project reported on its efforts to build a “Collaborative Platform for Societal Issues.” “In order to experiment with a renewed collaboration between researchers and journalists, the PLACES project wanted to make them co-actors in the process, inviting them to take part in a citizen science approach,” an executive summary of the report said.

And an article on using satellite imagery in journalism, published in the Online Journalism Blog, noted that such work almost always involves a collaboration.

“Most journalistic pieces that use AI and satellite imagery are collaborative projects and rely on a data expert,” that author, Federico Acosta Rainis, said in that article.

The CLEAN leaders hope this work will be an example for what can be done elsewhere in criminal justice reporting. The effort followed up its initial convening in 2019 with another daylong meeting in December 2021 (with a hiatus in between due to the pandemic).

“It will be a prototype in terms of a model of cooperation,” Scheck said at the beginning of that 2021 meeting. “That stakeholders in the criminal legal space, -- we have with us today, progressive prosecutors, inspector generals, a lot of reporters, public defenders, civil rights lawyers, you name it – can all get together and in an efficient, appropriate way, share information about law enforcement misconduct that we have never known about before.”

On the data end of the project, the data scientists are working to build what they call a federated database, one that will make it easy to both protect private data and share public data among the partners. The second component is to build tools to make it easier to access and analyze the data, said Perlmutter in the 2021 meeting.

“The conversation is now very up-front and center about how best to help society and do fundamental research with data science,” he said before ticking off the ability to use newer machine-learning methods, other AI systems, cloud computing and more. And seeing disparate organizations work together is exciting, he added. “It makes you feel that we have the capability to solve the deep problems that we’re seeing.”

The push by 20 news organizations to collect police certification data expects to use some of the same technologies being developed by CLEAN. And the certification project is already using a data processing pipeline developed by Big Local News. That effort, like many of these new collaborations, also is building on previous efforts. John Kelly, now a data editor at ABC-owned stations in the United States, first approached this growing, informal network of news organizations eager to work together because he had already developed an earlier version of a certification database for a project at USA TODAY. But moving the scope beyond one organization makes it easier to scale a project in a way that benefits more newsrooms. And that benefits the public more too.

“Local newsrooms don’t always have the time or the technological skill or resources to go through these records on their own,” said Pickoff-White. She added that the next step will be to make at least some of the data more widely available. “We’re hoping to allow members to more easily sort and sift and analyze records about policing in California and eventually disseminate this information to the public.”

That Bakersfield story about police breaking bones with baton strikes was made possible because of the involvement of more than a dozen students from Stanford and Berkeley who “painstakingly” entered all the data, noted David Barstow, the head of investigative reporting at the UC Berkeley Graduate School of Journalism, at the 2021 convening. “That was something that only emerged from this systematic examination of every single use of force in Bakersfield.”

As the tools built by BIDS continue to be used, the available data will grow, and so will the possibilities.

The increased transparency that comes from such efforts may mean that policing itself will change, Scheck said. “They (police) should be subject to the same transparency as any other public servant,” he said. “And that’s what’s at the heart of this struggle right now.”

Where do collaborations end?

Back in Bakersfield, a few weeks after the stories about police breaking bones, student reporters were back at it. For their second story, as part of the CLEAN project, they focused on mental health issues. Their work was published by local radio stations, KQED in San Francisco and The Associated Press. And after that, the reporters didn’t just leave. Instead, they took the collaborative ethos to a new level.

Creating a system that will include disseminating policing data to the public means understanding what the public needs and wants. And that means you guessed it, collaboration. Students and journalists met with community groups, surveying them for what they wanted out of the new police transparency effort. Now, that information is being used to help develop the searchable interface available to the public.

Lessons learned

  • Build in time to turn messy police records into structured data You can use the power of a curated crowd, those collaboration partners, to make the lift of turning messy documents into valuable data achievable. At the same time, you can produce related stories as you go. You will find nuggets of newsworthiness as you collect and read through the documents you obtain. Report on those. You can use them at the end of a big project or publish as you go. Either way, you are building forward momentum.

  • Collaborations can cross domains Work across spheres with lawyers, data scientists, engineers and more to achieve your goals. Some projects are just too massive to achieve otherwise. But set up working agreements from the start. At a minimum, it ensures you have had important conversations from the outset.

Use other data to add vital context. Some examples include

subscribe figure