Social data reporting: AMA with Lam Thuy Vo

Conversations with Data: #11

Do you want to receive Conversations with Data? Subscribe

Conversations WData-header-1

What do you remember most about 2012? Some of you might think back to the London Olympics, or perhaps Gangnam Style… but we’ll always remember it as the year that the European Journalism Centre launched the Data Journalism Handbook.

Flash forward six years and we’re producing a second edition for beta release. But we’re a wee bit too excited to wait until then. In anticipation, we’ve already published an exclusive preview chapter on algorithmic accountability.

And now, this edition of Conversations with Data brings you an 'ask me anything' with the author of the Handbook’s social media chapter, BuzzFeed’s Lam Thuy Vo. She answers your questions on getting started with user-generated content, privacy, combating fake news, and more.


What you asked

What makes social reporting different from other types of data journalism?

Lam: "There’s a tendency among some journalists to examine the social web as a one-to-one representation of society itself. It’s a natural inclination: the ways in which we see information come in on newsfeeds, timelines and comment strings, makes it seem continuous and like a representation of the people who compose our actual immediate environment.

But social media data is odd: while there’s a rigidity to its format that’s akin to that of other data sets (date times, content categorisation like text, video or photos, etc.), it comes with all kinds of irregularities related to who posted the content."

Can you give us an example?

"Take the data from a Facebook group, for instance. Even if a group has thousands of followers, only a fraction of them may actually actively react or comment on content, let alone post. The posts may come in spurts or fairly frequently without much of a regularity to them -- and this data may change at any moment as more people comment or react to the posts. All this makes for highly unwieldy data that needs to be interpreted with care and caution."

In addition to this risk of misinterpretation, social data is inherently personal, bringing with it privacy and ethical challenges. How can journalists address these?

"First, there’s an approach I often refer to as a quantified selfie -- an examination of a person’s data with their permission and also with their help of interpretation. Since social media data is by nature highly subjective, doing stories about an individual is likely best done with their help to interpret these stories."


From Lam’s quantified selfie project, which used most played songs to showcase the emotional impact of moving to New York.

"When working with content created by everyday people, journalists should also be very conscientious about people’s privacy and what the amplification of social media posts could mean to them. BuzzFeed News’ ethics guide contains very helpful language on the subject: 'We should be attentive to the intended audience for a social media post, and whether vastly increasing that audience reveals an important story -- or just shames or embarrasses a random person'."

What are some other challenges?

"What’s particularly difficult is that social data can be very ephemeral. For one story, for instance, Craig Silverman, Jane Lytvynenko, Jeremy Singer-Vine and I looked into how popular hyperpartisan news organisations were as compared to their more mainstream counterparts. The hardest part was that during the reporting, data parsing (more than four million Facebook posts) and analysis, Facebook pages included in the analysis kept being deleted. Not only did we have to deal with large amounts of data that would strain my computer, this data set was also a constantly moving target."

Following on from that example, how do you think journalists can use social media data to combat fake news?

"One very important part of combating the spread of misinformation and false made up stories is to report on the subject and find ways to hold companies accountable while also educating news consumers."

"...there’s a huge need for people to learn about the fallacies of how we see the world through social media. Explanatory reporting in that realm can be hugely impactful. From the distortion of information through filter bubbles to issues surrounding troll attacks, through examples of other individuals we can hopefully raise a level of scepticism in people to prevent them from sharing information in reactionary and emotional ways, and to encourage a more critical reading of the information they encounter online."


A screenshot from Lam’s trolling visualisation, illustrating to audiences what a Twitter attack feels like.

Finally, do you have any advice for journalists who are new to user-generated content?

"While each story is unique I’d say there is one good guideline I’ve picked up: be specific in your stories -- people can get lost in data stories that are too large in scope. For example, it’s more definitive and hard enough to do a story about one particular group of politicians who are Facebook users and are spreading hate speech, than to try and prove the spread of hate speech among an entire part of society."

Our next conversation

While social media data can usually be obtained via API, there’s often times where you’ll have to scrape it yourself. For our next conversation, we want to feature your tips for these situations.

Until next time,

Madolyn from the EJC Data team

subscribe figure