How to preserve data journalism

Name: DataJournalism.com
Price range: $

Exploring the possibilities for archiving and saving interactive content and data storytelling

18 July 2022

News organisations have longstanding practices for archiving and preserving their content. The emerging practice of data journalism has led to the creation of complex new outputs, including dynamic data visualisations that rely on distributed digital infrastructures.

Traditional news archiving does not yet have systems in place for preserving these outputs, which means that we risk losing this crucial part of reporting and news history.

Taking a systematic approach to studying the literature in this area, along with experts in digital archiving preservation, Kathryn Cassidy, Edie Davis, and Natalie Harrower, I studied the implications of the new types of content as the output of data journalism with respect to archiving and preservation of these content, and looked into potential solutions that we could borrow from other more established disciplines such as data and digital archiving, software and game preservation and so on.

In a journal paper we published, we identify the challenges and sticking points in relation to preservation of dynamic interactive visualisations, and provided a set of recommendations for the adoption of long-term preservation of dynamic data visualisations as part of the news publication workflow, as well as identifying concrete actions that data journalists can take immediately to ensure that these visualisations are not lost. Here I take you through some of the problems we identified in our study and the recommendations for preventing further and permanent loss of content.

Evolving technology threatens preservation of new forms of content in different ways.

Traditional journalistic outputs were usually published in text and audiovisual format, with news organisations having a longstanding history of archiving and preserving these outputs on various media.

This included paper, tape, or hard disc drives, depending on the historical time period and the original format of the output. Similarly, institutions such as national libraries and archives generally hold large and long standing newspaper archives.

Data journalism and its enthusiastic uptake in the past decade, however, has opened up a new set of challenges for preservation and demands for new guidelines and practices. The output of data driven journalism still includes traditional text and audiovisual formats, but also it includes data visualisations and/or news applications.

Many of these visual elements rely on digital infrastructures that are not being systematically preserved and sustained as traditional news archiving has not accounted for these dynamic and interactive narratives.

These visualisations communicate key aspects of the story, and without them, in many cases the story is either incomplete, or entirely missing, and so is a part of history.

At the same time, an increasing number of such new, complex outputs are being generated in newsrooms across the world every day, and it is expected that this trend will continue to grow. Without intervention, we will lose a crucial part of reporting and news history.

Where is the problem coming from?

Data visualisations are one of the core outputs of data journalism. They could be in the form of static image files (e.g. jpeg, gif, png, etc.), but in many cases they are dynamically generated at the time of viewing, by computer code.

For example, many of interactive data visualisations these days are JavaScript based, such as those made using D3.js libraries, or online and/or interactive data visualisation tools that are written on top of JavaScript libraries, such as Datawrapper, Flourish, Charticulator, Carto, Mapbox and so on.

These data visualisations are hosted on online web servers and possibly outside of the news organisation. If the code behind the visualisation breaks, the server goes offline, or the link between the publication website and the server hosting the visualisation breaks, then the visualisation disappears or renders an error.

We consider any visualisation beyond a simple image to be a dynamic data visualisation. As such, all interactive data visualisations are considered dynamic. Such dynamic content cannot be captured by existing tools and methods of archiving, such as tools for archiving web pages or images and videos, and consequently are being lost.

Dynamic data visualisations are essentially software, and their preservation therefore should include methods suited for software preservation.

My colleagues in the preservation domain consider these dynamic data visualisations as ‘complex digital objects’.

These are distinguished from ‘simple’ or ‘flat’ objects such as image and video files, as they are more challenging to maintain and preserve for long term and sustained access, because they rely on complex digital infrastructures that contain a series of technical (inter)dependencies, where each part of the infrastructure must function in order to deliver the final output.

Simple objects are more likely to be maintained long term, because they fall under existing preservation methods used within news organisations since the beginning of the 20th century.

In contrast, the many infrastructures that support ongoing access to dynamic visualisations are not being systematically sustained or preserved in a way that would ensure access to data journalism outputs.

In many cases, the organisation that creates the visualisation, and holds an interest in its preservation (the news organisation), is not usually the same organisation that holds the key to that visualisation’s sustainable accessibility.

Without intervention, we will lose a crucial part of reporting and news history.

Evolving technology threatens preservation of new forms of content in different ways. Here, I list the four primary factors that we identified in our research to endanger the preservation of data journalism outputs:

Third-party services: Many data visualisations make use of third-party data visualisation tools, such as Datawrapper and Flourish, which provide useful and often sophisticated assistance in creating visualisations.

However, the use of these tools creates risk because of dependencies on the tool provider: the tool may not be maintained by the provider, changes made to their underlying technologies may ‘break’ the connection to published visualisation on a news site, or the service might disappear altogether.

This has already come to pass with the shutdown of Silk.co and Google Fusion Tables, both data visualisation services once popular with data journalists.

In the case of Silk.co the website closed on short notice, ceasing access to any data visualisations that had not been exported or migrated by creators prior to the shutdown.

Dynamic data visualisations are essentially software, and their preservation therefore should include methods suited for software preservation.

A similar scenario happened a year later in December 2018 when Google announced that they would retire their Fusion Tables service.

Fusion Tables were one of the tools behind many early examples of Data Journalism, such as the Wikileaks’ Iraq war logs or the UK Riots in 2011, published by the Guardian.

Screenshot 2022 05 15 at 20 31 50 — FIGURE 2. Screenshot from October 2021 of The Guardian story, depicting how the content gets lost when the third-party services are not maintained.

Screenshot 2022 05 15 at 20 29 14 — FIGURE 1. Screenshots from the Guardian story, depicting how the content gets lost when the third-party services are not maintained: www.theguardian.com/news/datablog/2010/oct/23/wikileaks-iraq-data-journalism. Screenshot taken on 5th August 2020.

Both stories were early exemplars of Data Journalism as we know now, and manifested in many talks, tutorials and introductions to Data Journalism, including Simon Roger’s TEDx Talk on ‘Data-journalists are the new punks’. I still play the video of his talk in my classes, but none of the maps, the core of these stories, are there.

Google Fusion Tables was switched off at the end of 2019, and much of the associated content disappeared. The Guardian examples mentioned are only two of many stories with missing visualisations across news organisations in the past number of years.

2. In-house tools:

While many workflows rely on third-party apps, some organisations have also designed in-house tools.

These may afford greater control over the tool and its integration with internal technologies, but often these tools have been designed for specific purposes, such as to communicate the data behind a given data-driven piece.

The longer-term use of the tool or its maintenance may not have been considered during the design process, or no strategy has been put in place to track, archive and preserve the output of such tools.

Additionally, these tools are often developed by a small number of (if not one) interested news nerds in the organisation, who may not stay in the same organisation for long, and the continued usage or maintenance may completely vanish with the departure of individual(s) involved.

3. Content Management Systems:

The public-facing website of a news organisation is usually fed by a backend Content Management System (CMS), which itself is regularly maintained, updated, and periodically replaced by new platforms.

Through these changes, the embedding functionality that connects the visualisation to the CMS can be broken or rendered incompatible. In this case, the visualisation and/or the tool remain intact, but the visualisation is not fetched or displayed properly on the news organisation website.

For example iFrames have been one of the common ways to embed data visualisations created with external online tools into stories. An iFrame essentially creates an opening on an HTML page, which can pull content from external websites, including visualisations created in a range of external websites, such as Datawrapper and Flourish, or the above Google Fusion Tables in the Guardian stories.

Most online data visualisation tools provide iFrame embed codes, which the journalist can simply copy and paste to their organisational CMS.

The smallest change in the iFrame or embed code management in the CMS could break this link. In such a case, the content remains hosted externally, but the content will not be shown on the publisher website.

4. Myriad of other technologies:

While the above risks point to significant changes in known aspects of the technology chain, there are other dependencies that underpin visualisations, such as particular programming languages, libraries, databases, hosting platforms and tools.

These change over time – by the news organisation, the tool provider, or globally – and changes can cause the data visualisation itself to no longer be accessible or viewable.

An example of technological change can be seen in the consequences of Adobe’s decision to retire Flash. In countless stories published around and before 2010, such as The Guardian’s articles on Earthquakes, or The Financial Times’ Banks’ Earnings, the visualisation itself was the article.

So their disappearance due to the deprecation of Flash resulted in empty pages, with the now-useless suggestions to download or update Flash Player as shown in the images below.

Screenshot 2022 05 15 at 20 47 23 — FIGURE 3.1. Screenshot taken in February 2021 from The Guardian story, depicting the disappearance of the full story due to the deprecation of Flash.

Screenshot 2022 05 15 at 20 51 17 — FIGURE 3.2 A screenshot taken in February 2021 from The Financial Times

A 2010 paper by Edward Segel and Jeffrey Heer studied 58 visual stories from several publishing houses in their research on narrative visualisation.

Unrelated to their findings, I note that most of the visualisations they studied are no longer accessible. It happens that at the time of their research, Flash was the go-to technology for creating interactive visualisations.

Just 10 years after this study, Flash Player was deprecated and consequently very few of the visualisations remain accessible. Flash will not be the only casualty, as preferred apps and scripts continue to change over time.

In addition to the large-scale failures, all digital objects, simple or complex, are in danger of degradation or loss over time, due to factors such as data corruption (bit rot) – the obsolescence of file formats, software and hardware – and the limited lifespan of storage media.

For all of these reasons, it is imperative that news media prioritise digital preservation.

Screenshot 2022 05 15 at 21 15 23 — A screenshot image of a message from Adobe explaining support for Flash Player ended in December 2020.

How to tackle these problems

The findings in our study identified several obstacles, ranging from specific technical challenges to broader social and organisational issues. You can read about the details of it here.

But in short, two broad approaches emerged from the preservation methods:

1) Preservation of visualisations in their original working form

This approach entails keeping a working version of the visualisation available through methods such as emulation, migration, and virtual machines.

An important category emerged with respect to this approach includes the discussion of specific tools for preservation. The tools used for this purpose mentioned included ReproZip, which is primarily aimed at reproducible scientific research, and provides functionalities for collecting the code, data and server environment used in computational science experiments.

Other well-developed tools exist to capture entire webpages or websites. Examples are WebRecorder and the International Internet Preservation Consortium (IIPC) Toolset, comprising the Web Curator Tool and the well known and open source Wayback Machine.

In the Data Journalism Handbook 2, Meredith Broussard proposes that ReproZip could be used in conjunction with Broussard & Boss, 2018's article, a web archiving and emulation tool for preserving news apps.

While the web archiving tools may provide a useful starting point for preserving dynamic data visualisations, they are not always able to capture highly interactive data visualisations or those embedded which rely on server-side applications and data, such as those embedded via iFrame or other embedding features.

This is because the code is actually sitting somewhere outside of the current webpage. Furthermore, capturing the web through this method (which creates Web Archiving – WARC files) is difficult and complex and not likely to be implemented as part of journalistic workflows.

Additionally, there are a variety of other preservation, workflow management and configuration management tools according to articles by Chirigati et al., 2016; Steeves et al., 2017.

While the existing approaches towards keeping a working version of the visualisation in its original form through available web and software archiving, emulation, migration, and virtual machines are not specifically aimed at archiving dynamic data visualisations, have mixed results when capturing interactive content, and are complex and expensive to implement and maintain, they could shed a light on tools necessary for archiving data visualisation.

They could also provide valuable directions for future preservation of dynamic and interactive data visualisations in data journalism.

The user interaction and experience may be key to the meaning and value of a given data visualisation.

2) Flattening the visual

The second approach attempts to capture a “flat” or simplified version of the visualisation via methods such as snapshots, documentation, and metadata.

A flat or simplified version, considered in digital preservation language under the category of ‘surrogates’, essentially turns dynamic visualisations from complex digital objects into simple objects, such as images, GIF animations or videos, which are more easily preserved.

The dynamism is not maintained, but an effort is made to capture a sense of the original visualisation to preserve at least some part of it from total loss.

How to choose? Significant Properties

In choosing which of these approaches is most suitable for a given dynamic data visualisation or a given story in news and journalism, in our paper we draw on the concept of ‘Significant Properties’ of digital objects, originally proposed by Margaret Hedstrom & Christopher A. Lee in 2002 as their response to archiving of digital items in relation to their original physical object, such as a physical book being archived in digital format (on microfilm!), or when digital objects were converted from one format to another.

The idea was that the digitised version of a book, for example, may not be capable of preserving all of the properties of the original hard copy materials, such as accurate colour representation or the exact physical dimensions of the originals.

Significant Properties are those properties of digital objects that affect their quality, usability, rendering, and behaviour. These are typically technical or behavioural characteristics of the digital objects, which need to remain unchanged when the file is accessed in the future, in order for the file to fulfil its original purpose.

In the case of image files this might include aspects such as the height, width and colour depth of the image, while for video content it could include aspects such as the playback length and frame rate.

It may not be necessary to preserve the entire interactive data visualisation in a working form.

Software and other interactive digital objects tend to have more complicated significant properties relating to their behaviour and the types of possible user interaction.

Computer games, for example, are inherently experiential: the experience of the game is a significant property of the application. This can also be the case with data visualisations. The user interaction and experience may be key to the meaning and value of a given data visualisation.

On the other end of the spectrum, interactivity may not provide vital value to the visualisation, rather the information conveyed through different interactive elements may be considered the significant properties. Or it could be somewhere in the middle.

The first step here therefore for us would be to identify these significant properties, put next to the time and resources available, and go forward in relation to our preservation methods accordingly.

Where interactivity is a Significant Property, an approach using techniques such as emulation or migration may be indicated, as this preserves a working version of the original visualisation, and is thus more likely to preserve all of the significant properties of the object.

This approach would be in line with recent recommendations by the Digital Preservation Coalition on preserving Software.

On the other hand, for some interactive data visualisations, dynamism and interactivity are not significant properties of the object, and much of the message is communicated without these aspects.

In such cases, it may not be necessary to preserve the entire interactive data visualisation in a working form, as an approach using snapshots and documentation as surrogates for the original may satisfactorily retain the significant properties.

Identifying whether and to what extent these are significant properties of a visualisation can help in selecting which approach to take in its preservation. But these must be considered alongside other resource and workflow requirements and limitations for preservation.

Non-technical challenges

Regardless of the technical approach taken to preservation, several systemic methods could be drawn on from recognised topics in the digital preservation domain. Overall, our research indicates that the complexity of the task of preservation is the biggest obstacle to preserving these objects.This complexity is not limited to technical aspects.

Rather, it is in part attributable to the wider cultural or organisational challenge of digital preservation, where resources - financial and human - are limited, preservation is not embedded in publication workflows, and advocates for preservation are few and far between.

Furthermore, the responsibility for these actions must be identified and pursued systematically. Awareness-building around preservation, guidelines for preserving visualisations, and training on how to integrate preservation into workflows can assist with these larger social or organisational challenges.

Recommendations for going forward

Recommendations for Immediate and Practical Interventions by Data Journalists

Here I start with a set of immediate and simple actions that could be taken by data journalists to ensure partial preservation of the content they are producing now, in lieu of more robust approaches to be developed and implemented widely in the future.

These are approaches that assume limited time and resources combining a basic identification of significant properties, along with the creation of surrogate output of types that are easily preservable using current technologies, such as images and audiovisual formats.

This is essentially a basic form of the snapshot method identified in the literature.

If a number of screen grabs in the form of a GIF animation cannot do justice to the visualisation, then consider creating a video cast of the data visualisation in use.

We propose that for every dynamic data visualisation included in a story, the journalist should:

Identify the significant properties of the data visualisation, in terms of the importance of the story at hand.
If an image screenshot of the data visualisation could represent these properties to a satisfactory degree, then take a screenshot of the visualisation, and store it with other archived audiovisual content.

Screenshots have been used by some news organisations in their archiving practices. Figure 4.1 and 4.2 depicts two examples from The Washington Post and the New York Times, where the story is missing due to the issue of Flash, but the organisations offer access to alternate archived content.

Screenshot 2022 05 15 at 20 54 26 — FIGURE 4.1. Screenshots taken in February 2021 from The Washington Post depicting the disappearance of the stories due to the deprecation of Flash, as well as the accessible screenshots through their archives.

Screenshot 2022 05 15 at 20 58 44 — FIGURE 4.2. Screenshots taken in February 2021 from The New York Times, showing the inability to read a story thanks to its reliance on Flash.

Following the link in The Washington Post story retrieves a PDF, which had been previously generated for the print version of the story.

Clearly, this conveys an acceptable degree of the original story’s intention. However, the link in The New York Times story retrieves a screenshot that only shows the first slide of a multi-slide story, which means a significant part is missing.

If an image screen grab cannot capture the story to a satisfactory level, then we propose two alternatives:

a. If a small number of screen grabs can tell the story, then create a GIF animation that includes these in sequence, and archive as above. GIF animations allow limited animation but are nonetheless relatively simple image files which are straightforward to preserve.

Many news organisations already create animated GIFs for content promotion on social media and so the tools and expertise are readily available.

The Economist data desk, for example, provided a workshop on From interactive to social media: how to promote data journalism at the 2018 edition of the European Data & Computational Journalism Conference, for which they create GIF animations to promote their interactive data visualisations on social media.

These GIF animations, in essence, capture some part of the significant properties of the original interactive data visualisation.

b. If a number of screen grabs in the form of a GIF animation cannot do justice to the visualisation, then consider creating a video cast of the data visualisation in use, highlighting the most important parts. A range of widely-available free tools can be used to create such video content which is also relatively simple to preserve.

These simple surrogate representations must also be linked to the original story to ensure that the reader can find them if the story remains available, but the original visualisation is no longer available.

This linking could be via a structural solution whereby the CMS of the news organisation allows an alternate link to be specified and automatically displays the file behind the link if the main visualisation fails to load.

An alternative or possible interim solution would be to include a link under each visualisation to the surrogate version which invites the user to click on it if the visualisation does not display correctly. An example of how this has worked in practice could be seen in Figure

Creating an image, GIF animation or a video of your data visualisation is an uncomplicated solution that enables the capture of significant properties in terms of content as story, providing a stop-gap until more systematic and sophisticated methods for preservation of dynamic data visualisations are in place. In addition to long-term preservation and access, this simple method could also cater for issues associated with loading complex objects across devices.

These recommendations address the need for an organised and sustainable approach to the long-term digital preservation of data visualisations.

As such, we also propose that every provider of data visualisation creation tools should ideally provide GIF animation and video exports, in addition to their current visualisation exports.

Many data visualisation providers promise their users that in the case of company closure, users will be given the option to download the code behind the charts.

This is a responsible offer, but most journalists will not have the time or skills to execute that code on a different platform. Nor will they be able to go back to every single story they created to update the server information for where the data visualisation is hosted.

Hence, it is advisable that journalists create simple exports of their data visualisations at time of publication, and provide the information for how these can be accessed if the original publication fails.

Both data journalists, and the wider digital preservation community, should advocate with vendors of these tools to help bring this about.

The preservation of the datasets that underlie data visualisations is also key.

These immediate and relatively contained measures could ensure that much of the data journalism currently being produced is not lost entirely, while the newsrooms find ways to implement the recommendations to ensure longer term systematic preservation of such complex objects.

In addition to these, in the paper, my colleagues and I provide a set of recommendations for systematic and more long term interventions. These recommendations draw on the systematic study of the literature in a set of relevant areas such as web archiving, digital preservation, software and game archiving, methods detailed in professional literature from the fields of data journalism and digital preservation, as well as our professional expertise as academics and practitioners in these areas.

Our recommendations for systematic, organisation and discipline based interventions fall into several categories, including guidance and education, infrastructure and tools, collaboration with trusted, local and national digital repositories and memory institutions, funding and resourcing, and legal frameworks.

These recommendations for long term and systematic interventions address the need for an organised and sustainable approach to the long-term digital preservation of data visualisations.

They aim to ensure that these increasingly important elements of journalistic output are routinely preserved alongside simpler forms of digital news media.

These medium to long-term actions require changes to workflows and investment into new policies, practices and technical solutions. As such, they require an investment of significant effort over time, financial resources, and collaborations that may expand the remit of existing institutions.

If you are interested to read more about these, read the Recommendations part of the paper.

Digital preservation is an ongoing process, not simply an endpoint.

I would like to note here that the scope of this article, and the research paper underlying it, includes works relating to the preservation of dynamic data visualisation and associated software code and dynamic digital objects.

The preservation of the datasets that underlie data visualisations is also key; in some cases, they are required to make the visualisation function as it is rendered. In any case, the data should be persistently accessible to verify the findings communicated by the visualisation. However, this is a separate, larger issue for digital preservation and is out of the scope of this article.

As a pointer and food for thought, the preservation of research data is being studied by international initiatives such as the Research Data Alliance and the CODATA committee of the International Science Council, which could provide valuable input into the preservation of data and code when it comes to data journalism.

Digital preservation is an ongoing process, not simply an endpoint. Methods must evolve within and by the communities that are most invested in the long-term stewardship of their outputs.

Because of journalism’s fundamental and unique contribution to the historical record, it is imperative that preservation is built into the production of data journalism, so that this key element of the record is not lost.

Author Biography

BH Malmo — Bahareh Heravi is a Data and Computational Journalism researcher, trainer, practitioner and innovator. She is currently a Reader in AI and Media at the Institute for People-Centred AI at the University of Surrey in the UK. Bahareh is a member of the Irish Open Data Governance Board, and a co-founder and co-chair of the European Data & Computational Journalism Conference. She previously was an Assistant Professor at the School of Information and Communication Studies at UCD, where she led the Data Journalism programme.

This article is a shortened and adapted version of an academic paper that Bahareh Heravi co-authored with her colleagues Kathryn Cassidy and Natalie Harrower from the Digital Repository of Ireland, and Edie Davis from the Library of the Trinity College Dublin.

For the full journal paper, and also for citation and referencing please visit the journal website.

How to preserve data journalism - Exploring the possibilities for archiving and saving interactive content and data storytelling

23 min Click to comment

Longform reads

Verification Handbook

Data Journalism Handbook 2

New course

Quality journalism

Countering hate speech

New course

Video course

Fundamental search for journalists

Popular course

Coding

Python for journalists

Write a response