Oodles. Troves. Tsunamis. With data increasingly stored in extraordinary volume, investigative journalists can and have been piloting extraordinary analysis techniques to make sense of these enormous datasets--and, in doing so, hold corporations and governments accountable.
They’ve been doing this with machine learning, which is a subset of artificial intelligence that deepens data-driven reporting. It’s a technique that’s not just useful in an age of big data--but a must.
The unwritten rule about when to use machine learning in reporting is pretty simple. When the humans involved cannot reasonably analyse data themselves--we’re talking hundreds of thousands of lines on a spreadsheet--it’s time to bring in the machines.
Reporters, editors, software engineers, academics working together--that’s where the magic happens.
What is machine learning?
For journalists just getting started, it might be comforting to know that machine learning shares many similarities with statistics. It’s also worth noting that the semantics are a point of contention.
“Reasonable people will disagree on what to call what we’re doing now,” said Clayton Aldern, senior data reporter at Grist who recently co-reported the award-winning series Waves of Abandonment which used machine learning to identify misclassified oil wells in Texas and New Mexico.
Indeed, a running joke is that “AI sells”--another data journalist referenced this image to me to make that point.
The sentiment isn’t unfounded. Meredith Broussard, professor, journalist and author of Artificial Unintelligence: How Computers Misunderstand the World, said in an interview with the Los Angeles Times that “AI” took hold as a catchy name for what was otherwise known as structured machine learning or statistical modelling, in order to expand commercial interest. But there are differences.
“For one, we’re not using pen and paper,” said Aldern, who has masters degrees in neuroscience and public policy from the University of Oxford. “We have the computational power to put statistical theories to work.”
That distinction is crucial, argues Meredith Whittaker, the Minderoo Research Professor at New York University and co-founder and director of the AI Now Institute.
Supervised machine learning has become “shockingly effective” at predictive pattern recognition when trained using significant computational power and massive amounts of quality, human-labelled data. “But it is not the algorithm that was a breakthrough: it was what the algorithm could do when matched with large-scale data and computational resources,” Whittaker said.
Scaling hardly means that humans aren’t involved. On the contrary, the effectiveness of machine learning in general, and for journalism, depends not only on access to quality, labelled data and computational resources, but the skills and infrastructural capacities of the people bringing these pieces together. In other words, newsrooms leveraging machine learning for reporting have journalists in the loop every step of the way.
“[Machine learning] has a big human component […] it isn’t magic, it takes considerable time and resources,” said Emilia Díaz-Struck, research editor at International Consortium of Investigative Journalists (ICIJ), which has used machine learning in investigations for more than five years. “Reporters, editors, software engineers, academics working together--that’s where the magic happens."
When is machine learning the right tool for the story?
Designing and running a machine learning programme is a big task--and there are numerous free or reasonably priced training programmes available for journalists and newsrooms to sharpen their skill sets--we describe the process and training options at the end of this article. But how does machine learning fit into the reporting process? Here are a few of the ways.
Managing overload: Clustering to find leads
When the International Consortium of Investigative Journalists, a nonprofit newsroom and network of journalists centred in Washington, D.C., obtained the files that would make up Pandora Papers--like the other exposés they’d reported including the Panama Papers, Paradise Papers--initially the sheer amount of information was mind-blowing.
“Reporters were overwhelmed,” said Díaz-Struck. Before they could tell stories, they needed to know what was there, and what they didn’t need. To accomplish this, the ICIJ reporters used machine learning to sort and cluster, among other methods. “First, it worked like a spam filter,” said Díaz-Struck, referencing a popular machine learning application, which sometimes uses Bayes’ theorem to determine the probability that an email is either spam or not spam. The task sounds simple but wasn’t easy.
“[Miguel Fiandor called it] a sumo fight. Big data on one side, and on the other, all of us, the journalists, reporters, software developers, and editors,” Díaz-Struck said.
Eventually, machine learning helped ICIJ cull data into more manageable groupings and together with ICIJ technologies as Datashare and other data analysis approaches, the team handled the big data. In parallel, more than 600 reporters from around the world took on the herculean effort of connecting the dots between reports of tax evasions and dubious financial dealings by hundreds of world leaders and billionaires.
Pointing fingers: Naming past misclassifications
Another popular use of machine learning is to name misclassifications. This was the tact taken in 2015, when Ben Poston, Joel Rubin and Anthony Pesce used machine learning for The Los Angeles Times to determine that The Los Angeles Police Department misclassified approximately 14,000 serious assaults as minor offences over an eight-year period. The misclassification made the city’s crime levels appear lower than accurate.
Similarly, BuzzFeed News’ investigation of secret surveillance aircraft to hunt drug cartels in Mexico, by reporters Peter Aldhous and Karla Zabludovsky, was a question of classification. The effort, which Aldhous documented in a separate BuzzFeed article and on GitHub, used a random forest algorithm, a well-known statistical model for classification, to identify potential surveillance aircraft.
And misclassification was vital in the ICIJ’s Implant Files. This expansive investigation found that medical devices implanted into people’s bodies--such as vaginal netting, copper coil birth control, breast implants, heart monitors, hip replacements, and so on--were linked to more than 83,000 patient deaths and nearly 2 million injuries. Of patients who died, 2,100 people, or 23% of these deaths, were not reported as deaths but more vaguely classified as “device malfunctions or injuries.”
Checking the wrong box has grave consequences, including misleading health authorities about when devices are linked to deaths and preventing the regulators from knowing a product merits further review--to the detriment of future patients. Díaz-Struck explained it took her team months to design and fact-check machine learning for this research. In the methodology article, published in 2018, she explains that text mining, clustering, and classification algorithms were all involved.
They went on to use machine learning to make a second classification, to identify patient gender, an unknown category that was not released by the patient files made available by the Federal Drug Administration in the United States. Many of those who had died or were harmed by implants were women, but not always from “women devices” such as breast implants.
Sometimes applications of machine learning come out of informal conversations.
What were the numbers? Partnering with researchers at Stanford University, Díaz-Struck’s team painstakingly trained a machine to identify the gender of patients who had been harmed, or died, from an implanted medical device using the presence of pronouns or mention of gendered body parts in the notes section of the reports.
After six months of effort, the team of nearly a dozen was able to identify the sex of 23% of the patients, of which 67% were women and 33% were men. A key limitation was the quality of the data, said Díaz-Struck. Nevertheless, the effects of this reporting are not only greater transparency but reforms.
By any other name? Predicting misclassifications
Sometimes applications of machine learning come out of informal conversations. That’s what happened with the Grist and Texas Observer story that predicts the number of unused oil and gas wells in the Permian Basin that will likely be abandoned in coming years. It will cost taxpayers a billion dollars. The story began with no talk of predictions, rather, it was an informal chat between Aldern and fellow Grist journalist Naveena Sadasivam.
“She’s been on the oil beat for ages and when the COVID-19 pandemic hit, the price of oil dropped. It even went briefly negative. When that happens, some of the mom-and-pop companies hit hard times. Would any go bankrupt, she wondered? And what would happen to the wells?” Aldern said. Sadasivam joined Texas Observer reporter Christopher Collins to find out.
Looking over data Sadasivam collected from public records requests, she and Aldern brainstormed “what we could say,” he recalled. They spent time organising it into a structured database, still unsure if there was a story. The dataset features included the production history of all the wells in Texas, plus macroeconomic indicators, employment, geotags, depth, drilling history for decades, and cost of oil over time.
“At one point we asked, could we use this to figure out the future? This was a classification problem: which wells might be abandoned in the next couple years,” Aldern said. “And it’s a perfect question for machine learning.”
The results were damning. They predict 13,000 wells would be reclassified from inactive to abandoned in the next four years, costing taxpayers nearly one billion dollars--not to mention the environmental effects of abandonment. Sadasivam and Collin’s on-the-ground reporting corroborated these findings, based on interviews with experts and ranchers who worried, “no one is coming.”
Aldern documented the methodology in an article and shared the data and code in a GitHub file. He also was featured in the Conversations with Data podcast earlier this year.
Holding technological black boxes accountable with machine learning
A subversive use of machine learning is holding privatised machine learning accountable. As the commercial rollout of AI has taken hold in the past decade--which has implications for newsrooms as well, which we will document in the second piece in this series--tech companies remain tightlipped about their processes, refusing to allow independent researchers to assess structured machine learning.
Meanwhile, algorithmic predictions have been criticised for reproducing inequalities as Virginia Eubanks, a political science associate professor at the University of Albany, argues in her book "Automating Inequality"; or incentivising--and bankrolling--disinformation campaigns, as Karen Cho reports.
For data journalists who are new to machine learning, it’s possible to follow along the work of others to learn.
The Markup, led by Julia Angwin, is a nonprofit newsroom focused exclusively on “watchdog” reporting about Big Tech. Like other newsrooms featured in this story, The Markup leverages machine learning and other data-driven methodologies to reverse engineer the algorithms or identify misclassifications and publish a “show your work” article and release the data and code.
Maddy Varner, an investigative journalist at The Markup said in an email that they use machine learning for investigations, including a random forest model in their work on Amazon’s treatment of its own brands, which was also described in a letter from Angwin, a story that took a year to break.
Transparency builds trust. “It is very important to not just to say what you know but explain why you know it,” said Aldhous, who explained that transparency is a cornerstone value at Buzzfeed News. “The greater the ability to see under the hood of the methods, the better. It is like, this is how it works. This is why we have that number. This is why we think that’s a spy plane.”
No need to reinvent the robot
If getting started sounds daunting, one of the benefits of data science is the open-source community, said Aldern. Data journalists share code and training data on GitHub, where other data journalists or data scientists can take a look.
Don’t be afraid to copy-paste. Borrow tried and true algorithms for logical regressions or decision trees. For data journalists who are new to machine learning, it’s possible to follow along the work of others to learn.
But reporting won’t be fast. Lucia Walinchus, executive director of the non-profit newsroom Eye on Ohio and a Pulitzer Center grantee, has spent more than six months using machine learning to analyse public records on housing repossession in Ohio. The project seeks to understand, mathematically, what makes land banks repossess some homes that are behind on taxes, but not others.
It's an open secret in any data story that the majority of the work is getting the data into order
“It’s the perfect problem for software,” she said. Though machine learning is only part of the story and doesn’t replace investigative on-the-ground research. Her inaugural machine-learning investigation is slated for publication in the coming weeks.
Resource-strapped newsrooms can consider partnerships with academics or companies. The ICIJ has partnered with Stanford University and independent companies to address particularly gnarly data problems while maintaining journalistic independence--crucial when dealing with sensitive materials for a big story that hasn’t yet been broken.
The ICIJ doesn’t outsource the work of training data to ensure accuracy, though they did use a machine learning tool called Snorkel to help classify text and images. Outsourcing the human work of labelling to platforms such as Amazon’s Mechanical Turk, which relies on humans who are paid pennies, has raised ethical concerns.
Data journalists can also be mindful of criticism about the costs of partnerships with tech companies, as Whittaker writes.
When independent journalists or academics need tech companies to access the computational power or intellectual resources to conduct research, those companies get to have the final say on decisions about what work to do, to promote, or discuss. “In the era of big data, journalists are not going to disappear, they are more essential than ever,” said Díaz-Struck.
Resources to master machine learning
To ramp up skills, there are free training programmes available. At the Associated Press, Aimee Rinehart is leading a new effort to expand local news organisations understanding and use of AI tools and technologies, funded with $750,000 from the Knight Foundation’s AI effort. News leaders in U.S. newsrooms can take a survey to inform the curriculum of an online course designed by AP; the survey closes in early December 2021.
After running the course, AP will partner with five newsrooms to identify opportunities for AI and implement those strategies. This initiative follows on the heels of the London School of Economics Journalism AI project funded by Google News Initiative, which also offered a free course on AI for journalists.
Investigative Reporters and Editors run data journalism bootcamps to teach hands-on technical skills to turn data into accurate, interesting stories. These trainings are not free, but prices vary based on the size of the newsroom, with scholarships available, as well as discounts for students and freelancers. Programmes support journalists to sharpen basic to advanced skills in spreadsheets, data visualisation and mapping, SQL, and coding in R and Python. Journalists should be members of IRE to enrol.
Data journalists can bootstrap their own training program by learning from and participating in machine learning competitions based on over 50,000 datasets, run by Kaggle. While not specifically designed for journalists, the competitions can be valuable and come with three-figure prizes (in U.S. dollars). A Google account is required.
How it works: Machine learning in a nutshell
Let's run through the machine learning process. The basic tasks include the following:
1. Assemble data. "It's an open secret in any data story that the majority of the work is getting the data into order," said Aldern. Data can be public, garnered from public records requests, scraped, or shared from an external source. Consider the questions you'd like to use the data to answer.
2. Identify labels and features of some data to build a statistical model. Criteria for identifying features might be drawn from the investigation. For instance, for the query on whether inactive oil wells in Texas and New Mexico were misclassified and could be soon abandoned, Aldern used state agency definitions of "orphaned" and "inactive" to label data. This intel was gleaned by Naveena Sadasivam and Christopher Collins, reporters on the oil and gas beat.
3. Test the model to avoid overfitting or bias. Models should make generalised accurate predictions. One trick to perfecting a model's performance is to divide the training dataset in half. Use the first half to train the model and the second to evaluate the accuracy trained model. Tweak the model based on the results of the test run on the second labelled dataset.
4. Analyse the unlabeled data. This step will leverage the trained data to provide an answer to the question you are asking of the remaining data: Which inactive oil wells could be misclassified as orphaned? Which files are spam? Which device reports have been misclassified as not causing harm? The methodology often relies on processes derived from statistical modellings such as linear regression, decision trees, logical regression, or clustering. It is written in programming languages such as R, Python or Julia.
5. Corroborate results. Aldern does this by "trying to prove myself wrong." To check the machine learning results, data journalists interviewed for this piece will ask data scientists affiliated with universities to independently review the results. Best practice also includes writing a methods article (as Aldern did here), along with sharing links to GitHub repositories. Finally, boots-on-the-ground reporting will substantiate results.
Key words in machine learning
:LABEL. The label is the thing that will be predicted later, the dependent variable, the y variable in linear regression. A label could be a noun or an event. Spam. Spy Plane. Orphaned. Dead.
:FEATURES. The features are the special things in the labelled item that will be used to define future undefined labelled items; otherwise known as independent variables. For instance, a feature might be pointy ears. A style of email address. Turn direction. Ownership status. Smoking habits.
:CODE. The algorithm used to analyse the data. Often, a version of a tried-and-true statistical model such as linear regression, decision tree, logical regression, or clustering but written in programming languages such as R, Python or Julia.
Thanks to Caroline Sinders, a machine learning consultant, for being interviewed for background research for this piece.