When data writes…who is the author?

Approaches to harmonise automated and human journalism


If a robot wrote this article, you probably wouldn’t know it. The movement to produce automated content, or ‘robot journalism’, has grown substantially over the past couple of years. But editorial policies surrounding the attribution of this content have not moved at the same pace.

Typically, human-written stories are accompanied by a byline crediting the journalist who spent hours researching and writing a piece. Yet, when a data-driven algorithm is doing all the work, this attribution imperative is not as clear. So, how do news organisations gauge the best attribution regime for automated journalism? And why is it important to develop a sound crediting policy for this type of content?

To find out more about the prevailing authorship and crediting policies for automated content, Tal Montal and Zvi Reich from the Ben-Gurion University of the Negev conducted a content analysis of automated stories. They surveyed 12 websites and collected qualitative data from interviews with associated news outlets. We spoke to Tal about their research.

Sky backdrop pexels

Your study examines transparency in automated journalism, particularly disclosure and algorithmic transparency. Can you provide a brief outline of these concepts, and tell us why journalists and audiences should care about them?

Transparency is a crucial feature of modern journalism and central to the ongoing debate over the necessity of openness towards the public. In the field of automated journalism, where the ‘black box’ of generative algorithms is being used throughout the different levels of news production, it is important to consider the potential implications of these algorithms in the development of crediting policies.

Disclosure transparency refers to the extent of which journalistic organisation and news producers communicate the way their news stories are selected, processed, and produced, to the readers. In our case, it can be reflected in the way of informing readers about the fact that a certain story was generated by an algorithm. This type of transparency, which has become more common with the rise of online journalism, is a must in the domain of automated journalism.

Algorithmic transparency is a more complex, expensive, and controversial type of transparency, in which the methodology of an algorithm -- including its limitations and the way it actually works, from input to output -- is disclosed to the audience of readers. It is more complex for several reasons: it is more time consuming and thus expensive, the methodology needs to be explained to readers who are unfamiliar with computer programming, and it is controversial due to trade secrets and copyrights issues. This last point is reflected by the fact that algorithms' owners, either news organisations or software companies from that field) are reluctant to reveal -- obviously -- the way in which their algorithms work.

Making these types of transparency a part of a crediting policy for automated content will diminish possible negative effects and help both journalists and readers understand the strengths and weaknesses of this way of content production.

You categorised your findings into four levels of transparency. What are they? How can we use them to assess the current state of disclosure and algorithmic transparency?

The four levels of transparency are based on a content analysis of automated stories from the sites that we had solid information in relation to their use of algorithms to generate stories. We distinguished sites, or different sections of the same site, that had a full disclosure note regarding the algorithmic nature of the specific story, the developer of the algorithm, or data sources, from those who didn't. The top tier is Full Transparency.

Besides a full disclosure note, all of the Full Transparency sites had a byline that credited either the news organisation (such as AP), software company (such as Automated Insights or Narrative Science), the human reporter, and even the bot itself (like Quakebot in LA Times).

Quake

The LA Times credits stories written by Quakebot using a clear byline.

Then, for sites which had only a byline and didn’t have a full disclosure note, we made further distinctions. Those with a byline crediting the software vendor or the algorithm itself, thereby implying but not explicitly stating the automated nature of the news piece, were labelled with Partial Transparency. Those with a byline crediting the organisation itself without mentioning a name of the writer, thus implying the uniqueness of this particular news story, were marked with Low Transparency. And those with no byline at all received No Transparency.

Only the Full Transparency sites actually employ both disclosure transparency, and -- to some extent -- algorithmic transparency. The Partial Transparency sites can only be considered to partly employ a disclosure transparency routine, although this is debatable. The Low Transparency, and of course the No Transparency sites, do not hold up to the standards of any transparency routine type.

What did your study reveal about news organisations' views towards the relationship between automated and public interest journalism?

This was a very interesting part of our report. It involved interviews with seniors and experts from a spectrum of roles including editors, managers, journalists, and developers, who came from varying fields of coverage like sports, weather, and finance, and covered different modes of automated content generation, such as in-house development or software vendors.

First, we discovered that there is a unanimous anthropomorphic perception regarding the author of automated content. All of the interviewees mentioned a single human author -- mainly the developer of the code -- or the organisation as a whole, as it is considered a collaborative process, and the organisation takes responsibility over the automated output). None of them regarded the algorithm itself as the author.

Second, when investigating their views regarding byline and crediting policies, most of them did not think that automated content requires a different or more adequate policy. Some of the respondents said it should not differ from the common human crediting policy, or any policy that prevails in their organisation. They did not mention a special need for adding a full disclosure note.

The third -- more interesting -- thing we discovered was that transparency is considered crucial in the eyes of these seniors and experts. Although they all regard the stories as having a human author, and don't believe there should be a special and unique crediting policy in the case of automated journalism, they believe that readers have the right to know that these stories are automated. This is their way of totally agreeing with the importance of disclosure transparency and, to an extent,, algorithmic transparency. For instance, one of the respondents even spoke about the importance of publishing the stories' data sources.

We must admit that the interviewees mostly come from Full Transparency organisations, which have already grasped the importance of transparency routines. Their views represent organisations that accept and act in accordance with these routines. We noted inconsistencies between holding and practising such views, along with discrepancies between these views and scholarly literature. These highlighted the crucial need in the field of automated journalism for a new comprehensive and consistent byline and full disclosure policy.

What challenges and opportunities does automated journalism present for public interest journalism?

The opportunities are exciting for both readers and journalists.

Readers are now provided with news pieces and journalistic stories in niche coverage domains, such as high-school basketball leagues. They can obtain important and accurate data in almost real-time, for example earthquake reports and earning previews. They also have the option to ‘drill down’ into granular data in broad and aggregated stories, such as algorithms used in the ProPublica’s schools project that provide the ability to filter and read a particular story about each and every school.

From the media’s perspective, automated journalism seems like a life-saver. I It expands the possibilities of reaching ‘long-tail’ readers with no additional marginal cost. For reference: developing and running an algorithm for generating sport recaps is almost the same for 1000 recaps, or 10,000 recaps, except for the reload and processing times. It also facilitates the creation of collaborative pieces, with the algorithms providing data and the short textual pieces, while the human writers expand or use it in a wider context. It even provides a way of automatically generating formulaic news pieces, such as earning previews. This releases the human journalists to write other, broader, deeper stories that still require human journalistic routines. In addition, these generative algorithms can perform as a stopgap for understaffed media organisations.

Sky backdrop pexels

Nevertheless, there are challenges that we cannot ignore.

We identified five major potential implications of using these algorithms in news and journalistic organisations:

  1. Practical implications, due to the fact that these algorithms are used in fields such as real-estate or security. Any mistake -- from incorrect data sources to a misleading sentence -- can affect the decision-making of readers in real life. This of course also applies to news stories generated by humans, but the quality assurance routine in relation to the following effects may differ.
  2. Psychological effects of algorithms are perceived as more objective, accurate, and fair. This affects both readers and journalists, and therefore may influence their evaluation processes and practices.
  3. These algorithms have the (potential) ability to choose the required data or process it in a certain way, and ‘frame’ it by either selecting relevant data, or using certain speech patterns. This means that they can affect the visibility of socio-political actors, or maintain a predetermined agenda, when covering social and political issues. These ramifications may lead to complaints and even lawsuits against the news organisations which use these generative algorithms.
  4. Following on, there lies the fourth implication: vicarious liability towards readers -- not only legally, but also from an ethical perspective.
  5. The final implication of automated journalism technology is the occupational aspect. It may threaten the jobs, practices, and the autonomy of human journalists, but, conversely, it also has the potential to assist and bolster their work.
Sky backdrop pexels

What is the best way to treat algorithmic authorship so that it aligns with the public interest?

Our study suggests a new, consistent, and comprehensive policy that distinguishes between an output that is fully generated by an algorithm (algorithmic content generation), to an output generated by an algorithm in collaboration with a human journalist (integrative content generation), while sponsoring the public interest.

Our suggested attribution policy for algorithmic content generation is:

  • The byline should be attributed to the software vendor, or the programmer in the case of an individual in-house programmer.
  • The full disclosure should clearly state the algorithmic nature of the content, while describing the software vendor, or the programmer’s role in the organisation, and detail the data sources of the particular story and the algorithm methodology.

In the case of integrative content generation, our suggested policy is:

  • The byline should be attributed to the human journalist(s), as the representative of the collaborative work done with the algorithm, in accordance with the anthropomorphic characteristics of the modern journalistic credit.
  • The full disclosure should declare the objects created by an algorithm in the particular story (a chart, map, specific paragraph), along with the content’s algorithmic nature (describing the software vendor’s business domain, or the programmer’s role), data sources of the story, and the algorithm’s methodology.

We accept the notions put forward by most interviewees and scholars: the human author of automated content, is the representative of collaborative work for integrative content generation, and the programming entity, such as the programmer or software vendor, is the representative for algorithmic content generation.

Our suggested policy is tailored, however, to the current level of technological development of robot journalism algorithms, which takes into account the current level of creativity, among other criteria. Any significant technological developments or legislative progress regarding computer-generated works may invite adjustments to this policy.

Read the full research article here.

subscribe figure