Write a response

Data journalism and the ethics of publishing Twitter data

Name: DataJournalism.com
Price range: $

But it's already public, right?

16 February 2018

By Matthew L. Williams

This article was migrated from datadrivenjournalism.net. It has been edited for DataJournalism.com, but some information may still be outdated.

Collecting and publishing data from social media sites like Twitter are everyday practices for data journalists. Findings in 2017 from Cardiff University’s Social Data Science Lab question the practice of publishing Twitter content without seeking informed consent from users beforehand. Researchers found that tweets collected regarding certain topics, such as terrorism, political votes, changes in the law, and health problems, create datasets that might contain sensitive content. These tweets may evince extreme political opinion, grossly offensive comments, overly personal revelations, and death threats to oneself, or to others. Reporting and handling these data in the process of analysis -- such as classifying content as hateful and potentially illegal -- has brought the ethics of using social media in social research and journalism into sharp focus.

Ethics is an increasingly salient issue in research and journalism that uses social media data. The digital revolution has outpaced parallel developments in research governance and agreed good practice. Codes of ethical conduct that were written in the mid-twentieth century are being relied upon to guide the collection, analysis, and representation of digital data in the 21st century. Social media is particularly ethically challenging because of the open availability of the data -- especially from Twitter.

The terms of service on many platforms specifically state that users’ public data will be made available to third parties, and users legally consent by accepting these terms.

However, researchers and data journalists must interpret and engage with these commercially motivated terms of service through a more reflexive lens. They must adopt a context sensitive approach, rather than focusing on the legally permissible uses of these data.

To publish or not to publish?

The Twitter APIs provide three levels of data access: the free random 1% that provides ≈5M tweets daily, and the random 10% and 100% that are chargeable or free to academic researchers upon request. Datasets on social interactions of this scale, speed, and ease of access have been hitherto unrealisable in the social sciences and journalism. Their availability has led to a flood of journal articles and news pieces, many of which display tweets with full text content and author identity without informed consent. This is presumably because of Twitter’s ‘open’ nature, which encourages the assumption that ‘these are public data’ and using it does not require the rigour and scrutiny of an ethical oversight. But even when these data are scrutinised, journalists don’t need to rely on the ‘public data’ argument, due to the lack of a framework to evaluate the potential harms to users.

What the research says

The Social Data Science Lab takes a more ethically reflexive approach to the use of social media data in social research. Our 2017 Lab survey into users’ perceptions of the use of their social media posts carefully considered online context, and the role of algorithms in estimating potentially sensitive user characteristics.

Our survey of users found the following:

94% were aware that social media companies had Terms of Service
65% had read the Terms of Service in whole or in part
76% knew that when accepting Terms of Service they were giving permission for some of their information to be accessed by third parties
80% agreed that if their social media information is used in a publication they would expect to be asked for consent
90% agreed that if their tweets were used without their consent they should be anonymised

These survey findings show a potential disjuncture between the current practices of social researchers and data journalists when publishing Twitter posts, and user perceptions of their rights as data subjects, and the fair use of their online communications in publications. Much of this disconnection stems from what is perceived as public in online communications and, therefore, what data can be published without consent or anonymisation.

Existing ethical guidelines that provide principles for research in public places focus on traditional forms of data and data collection. Most guidelines stress that consent, confidentiality, and anonymity are often not required when the research is conducted in a public place where people would reasonably expect to be observed by strangers.

However, the perceptions of the majority of Twitter users clearly differ from this viewpoint. This is most likely because Twitter blurs the boundary between public and private spaces.

A social media researcher must take into account the unique nature of this online public environment.

Internet interactions are shaped by ephemerality, anonymity, a reduction in social cues, and the realisation of time-space distanciation, leading individuals to reveal more about themselves within online environments than would be done in offline settings. This blurs the public and the private.

Research has highlighted the disinhibiting effect of computer-mediated communication, meaning that internet users -- while acknowledging the environment as a semi-public space -- often use it to engage in what would normally be considered private talk. Online information is often only intended for a specific networked public made up of peers, a support network, or specific community. It is not necessarily for the internet public at large, and certainly not for publics beyond the internet. When information flows out of the context it was intended for, it is viewed by unintended audiences and has the potential to cause harm. Academic and regulatory delineations of the public-private divide may not hold in online contexts, and, as such, privacy is a concept that must include a consideration of expectations and consensus within context.

Informed consent and anonymity are further warranted given the abundance of sensitive data that are generated and contained within these online networks.

My 2017 study shows associations between sexual orientation, ethnicity, and gender; and feelings of concern and expectations of anonymity. A principle ethical consideration is to ensure the maximum insight from data journalism whilst minimising the risk of actual or potential harm during data collection, analysis, and publication.

The potential for harm increases when sensitive data are estimated. These data can include personal demographic information like ethnicity and sexual orientation, information on associations (for example, memberships to particular groups, or links to other individuals known to belong to such groups), and communications of an overly personal or harmful nature (such as details on morally ambiguous or illegal activity, and expressions of extreme opinion). In some cases, this information is knowingly placed online, whether or not the user is fully aware of who has access to it and how it might be repurposed. In other instances, sensitive information is not knowingly created by users, but it can often come to light in analysis where associations are identified between users and personal characteristics are estimated by algorithms.

In order to balance the privacy of Twitter users -- taking into account the disinhibiting nature of the environment, and the abundance of sensitive information accepted -- with the needs of data journalists, we should collect data without explicit consent and seek informed consent for all directly quoted content in publications. The alternative solution -- that is, providing anonymity to directly quoted users -- is simply not practical in this form of research. This is due to Twitter guidelines and the issue of online search; quoted text is easily searchable, which renders users and their partners in conversation identifiable.

In the case of the reproduction of tweets -- that is, the public display of tweets by any and all means of media -- Twitter’s 2016 Broadcast guidelines stated that publishers should:

Include the user’s name and Twitter handle (@username) with each Tweet.
Use the full text of the Tweet -- editing Tweet text is only permitted for technical or medium limitations (for example, removing hyperlinks).
Not delete, obscure, or alter the identification of the user, except in exceptional cases such as where there are concerns over user privacy.
In some cases, seek permission from the content creator, as Twitter users retain rights to the content they post.

For data journalists to abide by these guidelines, informed consent should be sought from each tweeter before directly quoting their post in research outputs, given anonymity is not advised. This is particularly important considering Twitter’s view that users retain rights to the content they post.

The issue of deletion, and the ‘right to be forgotten’ further buttress the need for consent to directly quote. Twitter’s 2015 Terms of Service for the use of their APIs by developers required that data harvesters honour any future changes to user content, including deletion.

However, data journalists should not conclude that conventional representation of social media content is precluded. As in conventional journalism, journalists can make efforts to gain informed consent from a limited number of posters if verbatim examples of text are required.

In a similar vein, we propose that researchers conduct a risk assessment before publishing tweets in research outputs. The decision flow chart below is designed to assist researchers and data journalists in reaching a decision on whether or not to publish a tweet, and in what contexts informed consent -- either opt-in or opt-out -- may be required.

Data journalism and the ethics of publishing Twitter data - But it's already public, right?

7 min Click to comment

Longform reads

Verification Handbook

Data Journalism Handbook 2

New course

Quality journalism

Countering hate speech

New course

Video course

Fundamental search for journalists

Popular course

Coding

Python for journalists