Welcome to the battle of the titans. This article reveals the differences between the two most popular programming languages for data journalists: R and Python. We also present their similarities and whether it is possible to master both languages.
The point of this article is not to make a case for either language. Rather, we hope to provide a useful comparison between two very popular tools for data journalism. If you have been wondering whether you should acquire programming skills for data journalism, this article will explain what's involved in learning how to code. We also provide guidance for those who have already programmed in one language and are wondering what the other language has to offer.
Is coding an essential skill for a data journalist?
Let's start with the question of why: Why should a data journalist know how to code? While there are several benefits, the biggest one is that programming skills can expand the types of projects you can do.
You can be the type of data journalist who downloads a finished dataset, does all the data wrangling in Excel, OpenRefine, or a tool designed specifically for your newsroom. You can create graphics using Datawrapper, Tableau, or yet another in-house tool for data visualisation. The software that covers the various aspects of a data journalist's workflow already exists, and it can save you the frustrating journey of hunting for coding errors and reading countless Stack Overflow threads. Still, coding can be extremely powerful and give you a form of control and freedom unmatched by software.
What does coding offer that data analysis software can't?
The steps required to create a data journalism piece can vary from project to project, but we can generalise the data journalist's workflow. The process usually consists of four moments: compile, clean, visualise, publish. When is coding useful in each phase?
Whenever the data for your story is not immediately available, you need to compile it. This process can be highly facilitated by coding, such as extracting data from the web or obtaining it through an API. This process can be highly facilitated by coding, such as extracting data from the web or obtaining it through an API. Whether you are building a crawler to assemble data from multiple Web pages or collecting data from Twitter or the World Bank through their API, coding can help you become more resourceful and get around some typical obstacles to accessing data: you can source data that is not easily accessible, collect large amounts of it, and reduce the time it takes to compile it.
Whenever the volume of data is immense, coding makes the process of data cleaning faster, if not possible in the first place. Given software limitations on the size of datasets - Excel has a limit of 1,048,576 rows by 16,384 columns in 2021 - navigating large datasets can only be achieved through programming. As Basil Simon writes in the Data Journalism Handbook 2, there is a point "where data and code become companions."
This is the stage where no-code journalists are likely to suffer the least: The list of powerful visualisation software is extensive and growing. Creating charts that explain the story in your data can be done with Datawrapper, Tableau, Flourish, and other programmes. Yet many news organisations' data teams increasingly rely on programming to create their graphs. Both The Economist and The Times of London use R along with React; The New York Times relies on D3.js; The Washington Post uses a mix of languages and software, as is likely the case in many newsrooms right now.
Once you have collected, cleaned and analysed your data in a programming environment, creating visualisations can improve the workflow. In addition, with appealing libraries for data visualisation, both R and Python offer great ways to create appropriate charts while allowing for customisation and allowing the newsroom to define its own style.
What are R and Python anyway?
R is a programming language that emerged in the 1990s out of a desire to develop a powerful method for statistical modelling and data analysis. The language has since been used primarily in academia and by statisticians, although its user base is now expanding to other fields. Because of its specific purpose, it is a language with its own syntax. This means that it is difficult to learn for both those with and without prior programming experience.
R is coded by most in an interface (or IDE) known as RStudio, which, simply put, provides a healthy coding experience by showing you at a glance your history, plots, loaded packages, a help function, and more.
While R has the disadvantage of an idiosyncratic syntax, it is known for its ease of use thanks to over 18,000 packages that facilitate coding. Packages are compilations of functions that, when loaded and called, produce powerful operations without requiring you to specify the operation itself. And as James Fransham of The Economist recently noted, there are packages for almost everything in R!
Many users of R code in a style known as Tidyverse, it consists of a number of packages that make data wrangling, analysis, and visualisation very neat. One of these packages is ggplot2, probably the most popular data visualisation library available.
R was born out of the need to work with data in all its aspects, and it allows its users to do just that in a reliable way, with a large and active community of users specialised in statistics and data science.
R applications in data journalism
The Economist's Off the Charts team recently entertained us by making the case for R and Python from the perspective of two data journalists who each use one of the two languages. In the end, it turned out that the composition of the team as a whole is more biassed toward R.
The Pudding has also shown to be working with R in some projects. The authors of This is Karen used data to find the names that correlate most strongly with Karen, and to whimsically warn anyone at risk of becoming the next "Karen."
The BBC Visual and Data Journalism team has developed an R package that allows journalists to generate charts that follow the company's house-style diagrams. They even wrote an R Cookbook that includes code and information about all the different plots that can be created with the package.
Python, unlike R, is a general-purpose programming language, which puts it closer to other languages like Java, C++, and C.
Like R, It was developed in the early 90s. Python's appeal is understandable when you consider why it was developed: to make code more readable and shorter. Since it is a general-purpose programming language, it has a wider range of applications compared to R. It can be used for both data analysis and application development. Due to its broader applications, it has a larger user base worldwide and is therefore considered to be very popular and sought after. In particular, it is the preferred language for machine learning and AI, which could make it very interesting for data journalists in the future.
Unlike R, there are more IDEs and more established libraries for Python, so it may be difficult to know how to get started at first. Popular IDEs include Spyder and PyCharm, but there are many, many more. In terms of packages, Python has a lot to offer as well.
Python application in data journalism
The International Consortium of Investigative Journalists (ICIJ) has revealed that it uses Python for much of the data processing in its recent groundbreaking investigation, The Pandora Papers. In the US, data journalist Melissa Lewis of Reveal used Python to analyse and visualise data for a study on the length of time immigrant children are detained in the United States.
Can I do the conversion?
If you are proficient in one language, the conversion should generally not cause you too much difficulty. The two languages are not worlds apart, but it is still advisable to start with one, get used to it, and then switch to the other. The switch is definitely worth it if you want to learn more general programming from R, focus on machine learning, and join a broader user base. Likewise, it's worth it if you want to work from Python with a large number of case-specific packages, with a particular focus on statistical analysis and visualisation.
Finally, it is probably not necessary to be proficient in both languages, but it would likely make you an attractive candidate and expand the number of organisations you can easily collaborate with, since you share similar skills and the programming language of your choice.
To wrap up
As a data journalist, you can survive without programming. There is software that can take you through all stages of creating a data journalism piece, from analysis to visualisation. Programming, however, opens up a world of possibilities and freedom for you. It is more powerful, transferable, reproducible and repeatable. And in the future, it will likely become even more in demand by newsrooms.
R and Python are the two most popular languages for data journalism. They are fundamentally different, but allow you to achieve similar results. If you are undecided about which language to invest your time in, here is what we advise you to think about:
- What do your colleagues use?
- Are you only interested in data analysis or do you want to develop applications as well?
- Do you want to learn a general programming language, or is statistics your main focus?
- What do your favourite newsrooms use?
As for the pair's pros and cons, below is a summary in the following chart.
Finally, with time and expertise, it is possible to master both languages. If you integrate both, you can benefit from the advantages of both languages, especially considering that each has some strengths depending on the task.
Below are some resources for getting started with Python and R. If you know of any others, we would love to hear about them! Join us on Discord and let us know!
- Paul Bradshaw’s GitHub
- Intro to R by MaryJo Webster
- How do I? ...do that in R by Sharon Machlis
- Practical R for Mass Communication and Journalism by Sharon Machlis
- R for Data Science, by Hadley Wickham and Garrett Grolemund