The only path to becoming a successful data journalist is to commit oneself to a lifestyle of continuous self-learning. This is simply the price of admission to specialise in this ever-evolving field of journalism.
At every turn, we are confronted with steep learning curves that require us to decide where we will invest our limited time and mental energy. These decisions are often made on the basis of the journalist’s perceived return on investment:
- Python or R (and then which libraries?)
- Postgresql, MySQL, or SQLite (do I even need databases?)
While grappling with those tough questions will inevitably remain a rite of passage, let me propose at least one learning trajectory with guaranteed journalistic returns: proficiency with geographic information systems (GIS).
What is GIS?
Speak with a few GIS professionals and a common theme will emerge: they struggle to explain to their loved ones exactly what it is they do.
Many of us understand superficially that GIS has something to do with ‘mapping’ and ‘geography’, but this is just the tip of the iceberg.
Similar to how tools such as spreadsheets or databases are used to manipulate, summarise, query, edit, and visualise information, GIS allows the same operations to take place -- but with the addition of a spatial dimension, connecting your data to a location in space.
For example, if you had a database of all homes built in your community, it might contain details about each house’s features, such as the year it was built, the number of floors, the total living space, the value of the property, when a building permit was last issued, and much more. With this information you could derive all kinds of interesting insights about the makeup of homes in your community.
By adding geocoded home addresses to this database, you would now have the ability to evaluate these homes based on their physical location to one another, on their density in certain areas, as well as their proximity to certain landmarks, such a landfill or a train station. This is GIS in its simplest form.
GIS touches every aspect of our lives
GIS technology and concepts are all around us and have real-world consequences. The following are just a few examples that are of great public interest:
- emergency services dispatching
- forestry management
- traffic and public transportation management
- flood forecasting and climatology
- housing development
- epidemiology and public health
- online food order and ridesharing services
- mail and parcel delivery services.
Any journalist hoping to closely scrutinise policy decisions emanating from these areas would be well served by learning the same tools and concepts that drive many of those very decisions.
This is GIS-driven journalism in response to the rise of GIS in society.
This is no different than a traditional political reporter learning basic accounting principles in order to make sense of government budgets and annual reports.
The good news is: Many data journalists have already embraced the use of spatial data and mapping in their storytelling. A 2017 study by researchers at the University of Hamburg found that maps were used by half of the 225 projects nominated for the Global Editors Network’s Data Journalism Awards between 2013 and 2016.
While collectively we are making use of maps as a powerful visualisation tool, my observation has been that many data journalists are missing out of some key opportunities to uncover additional insight within their data, especially spatial data.
Cartography vs. GIS
A significant number of maps used in media publications would more appropriately be classified as a form of cartography -- that is, mapmaking for the purpose of providing some form of geographic context through graphic visualisation.
Most fledgling-data journalists have at some point in their career succumbed to the irresistible urge to interactively plot any spatial data they could get their hands on, often using the soon-to-be-fully-deprecated Google Fusions Tables.
This was especially true at a time before the open data ethos took root in many public institutions, and spatial data was often closely guarded by internal gatekeepers. The novelty of having that map file in hand after months of freedom of information requests felt like justification enough to publish a map.
This author is guilty as charged.
The following map illustrates (albeit in an extreme way) the limited usefulness of simply representing information on a map:
At the core of the issue is the distinction between cartography, which is largely about representing data graphically, and GIS which seeks to analyse the spatial relationship between elements on a map.
Modern day roadmap: a Victorian era case-study
Widely considered as one of the first examples of modern epidemiology, English physician John Snow’s geographic tracking of an 1854 cholera outbreak in London is a textbook example of insight through GIS.
In light of a mounting death toll in a specific neighbourhood in London’s West End, John Snow embarked on a study that involved mapping the home residences of all persons who died from cholera infections. His review of the resulting data showed a tight clustering of fatalities around a single water source known infamously as the ‘Broad Street Pump’. It would later be discovered that the water source leading up to this public street-level supply was contaminated by raw sewage.
As put by former news and data editor for the Guardian, Simon Rogers, John Snow’s study and reasoning gave data journalists a 'working model' for how to approach their craft.
Consider this: If John Snow were alive today working as a public health researcher, the same analysis would have been done using a computer-based GIS application.
He likely would have also had unfettered access to municipal spatial files for the entire underground water and sewer line network, along with their maintenance records, exact pump locations, water consumption data, water quality testing results throughout the network, population density for each neighbourhood, and, finally, coroner or medical examiner reports on the cholera-related fatalities.
As it turns out, today’s data journalists could probably access most of those records as well.
Think of the possibilities.
Questions waiting to be answered by you
So, what are some concrete examples of how GIS can enhance your journalism?
- It can help calculate crowd sizes, as described by this Reuters article on the Hong Kong protests.
- It can help identify Airbnb listings that violate municipal zoning bylaws in your community.
- You could quantify the extent to which the presence of Uber in your city is affecting existing transportation services.
- Or, you could even debunk a commonly held belief that the wealthy get better municipal services, such as snow clearing after a blizzard.
These more advanced examples are just a few among many great ones, but the key is that they all look for patterns, outliers, and the connections between data.
Getting started with GIS: key concepts
1. GIS software
At some point, early in your exploration of spatial data and mapping, you’ll run into a situation where you’ll need to either convert a file type, modify the projection (more on this later), add attribute data, or make edits to a boundary.
For many journalists this represents the initial foray into QGIS, which is a free and open-source GUI desktop application. QGIS supports everything that a beginner would need; it also satisfies most of the needs of advanced users.
The other tool often used in newsrooms (and nearly always used by GIS professionals world-wide) is ArcGIS, a commercial software package. ArcGIS has some functionality that QGIS lacks, but because of the nature of the open-source community, plugins for enhanced features in QGIS are often available to help narrow that utility gap.
A good starting point is to download QGIS and follow along with their A Gentle Introduction to GIS tutorial.
Depending on your level of knowledge and programming skills, you can also explore how to perform GIS analyses through code, using spatial packages for Python or R. I would first recommend getting familiar with GIS concepts using a desktop application, however.
You may reach a point where you’ll find QGIS, Python, or R cannot cannot efficiently process a high enough volume of data. In these situations, many analysts opt for a more powerful tool such as the popular Postgresql database spatial extension called PostGIS, which essentially stores your spatial data inside of a database and allows the user to query these records using a series of SQL-esque functions. But this falls far beyond the scope of this Long Read.
2. Spatial files types
Because most of us initially learned to visualise maps using Google-based applications, Keyhole Markup Language (KML) files were our first exposure to spatial data. This is the default filetype for Google Earth, Fusion Tables, and other Google mapping tools.
KML files are text-based and resemble XML or HTML structures. You may also encounter KMZ files which are simply compressed KML files that have been zipped to reduce storage size.
As you progress in your GIS learning, the next file type your will likely encounter is a Shapefile.
The ubiquitous Shapefile is actually a collection of files that are nearly always distributed in a single zipped folder. The key thing to know about this file type is that each file in the bundle -- some of which are mandatory, others optional -- serves a unique purpose.
The file with the extension ‘shp’ contains the information that draws the points, lines, or polygons on the maps. The ‘shx’ file contains indexing information which helps speed up processing times. The ‘dbf’ file contains all of the attributes about each element. These three files are required, otherwise your Shapefile will not function properly.
Another common file contained in this bundle (but not required) is the ‘prj’ file, which specifies the projection to be used when the file is loaded (more on this in the next section).
One thing to note with spatial data is that it comes in two distinct flavours: vector and raster.
In most journalistic applications, vector files are used, however, it’s important to be aware that many other industries make use of raster files. Raster data often comes in the form of satellite imagery or aerial photographs, where the values given to each cell or pixel is the data itself (for example, a specific shade of green for a certain pixel in a satellite image could represent a type of vegetation). These types of data are frequently used in forestry and natural resources management.
3. Coordinate reference systems and projections
If you want to save yourself hours of frustration and troubleshooting, pay close attention to this section as it is foundational to GIS.
From my personal experience and from assisting other reporters over the years, a lack of clear understanding of how projections and coordinate reference systems work is the cause of nearly all errors for beginners and intermediate users alike.
So, what are projections?
The concept of projections comes from the fact that there is no perfect way to represent the surface of a sphere on a sheet of paper (or a computer monitor for that matter). To illustrate this point: take an orange and, after removing the peel, try to lay the skin flat on a table. See the problem?
Over the years, cartographers have come up with many different methods for overcoming some of these limitations, but none are perfect. These varying approaches for displaying the world on a flat surface come in different class families, are known as ‘projections’ and there are close to 6,000 unique ones for applications of all types.
Coordinate reference systems provide such frameworks for defining real-world locations. They come in two types: geographic coordinate reference systems and projected coordinate reference systems.
When you began working with maps, say on a Google platform, it’s likely that you simply uploaded your KML file or geocoded a series of latitude and longitude coordinates, and then proceeded to visualise them on a web mapping service, never considering there was a very specific coordinate reference system being assigned by default.
What you probably didn’t realise at the time was that you were likely working with a geographic coordinate reference system known as WGS84. This is the standard for most GPS devices and many online mapping services. Sometimes this coordinate reference system is represented as EPSG:4326, which is simply a different coding system for projections. You will often hear people refer to WGS84 as a ‘projection’ and, while this is technically incorrect, it is often acceptable to refer to it as such when speaking in general terms.
A key thing to remember is that when you are working with coordinates in decimal degrees, the units of measurement are in degrees. Hold this thought for now.
With projected coordinate reference systems, rather than working with angles (decimal degrees) on a sphere, you are nearly always working with coordinates on a two-dimensional plane with an X (longitude) and Y (latitude) axis. The unit of measurement can be metres, kilometres, feet, miles, and so on.
Depending on the purpose and especially the location of your work, it's important to select an appropriate projection and to recognise their limitations.
When you observe a traditional Mercator world map, the size of countries closer to the poles are exaggerated, while countries closer to the equator are minimised.
There is no better way to illustrate this point than by using the online The True Size… tool, which allows a user to click and drag countries over the top of each other in order to compare their actual surface areas. This reinforces the fact that every map projection introduces some form of distortion.
As the authors of the website point out: “Greenland appears to be roughly the same size as Africa. In reality, Greenland is 0.8 million sq. miles and Africa is 11.6 million sq. miles, nearly 14 and a half times larger".
I highly recommend watching this exceptional explainer video on map projections to better understand this concept and how it warps our sense of reality.
Why does all of this matter?
First, if you intend on measuring distances between cities, adding a buffer zone to a contaminated site, or calculating the surface area of an electoral district, the accuracy of your measurements could be jeopardised if you don’t select the appropriate coordinate reference system suited for your task. Be mindful of the scale and extent of your data. If your data spans only a city, your coordinate reference system should be different than for spatial data that spans an entire continent.
Secondly, remember how the coordinate reference system based on latitudes and longitudes uses degrees for its unit of measurement? It’s likely that you will want to be working with metres or kilometres for your project. In this case, you’ll need to convert your vector layers to a projection that uses your desired units.
Finally -- and this is extremely important -- if you intend to study the relationship between two spatial files, you must first make sure that both have matching projections. If you keep getting an error while trying new tools, this should always be the very first thing you verify. Best practice is to convert all of your spatial data to the same projection before you begin.
Read more about projections from the QGIS documentation.
A hypothetical walk-through: GIS in daily news
The following is a hypothetical newsworthy scenario paired with a walkthrough of potential GIS applications. Note that the suggested documentation all assumes the user is working with QGIS as their software of choice. This is not intended to be a step-by-step tutorial, but rather a high-level example of the mechanics of leveraging GIS tools for original news content.
1. Creating buffers
There has been a train derailment in your community and authorities say a chemical spill could have adverse health effects for residents within a 1,500 metre radius.
GIS tool to use:
After geocoding the address (or finding the latitude and longitude) corresponding with the location of the derailment, create a new vector layer in QGIS. As always, convert the layer to an appropriate projected coordinate reference system (one that uses meters as its base unit).
Then, use the geoprocessing tool called Buffer to enlarge the point on the map that represents the accident location to a circular boundary with a radius of 1,500 metres. This will result in the creation of new layer containing a single polygon.
Continuing with the previous scenario, in addition to visualising the danger-zone on a map for your readers, you may also want to report on the number of homes that fall within this buffer zone. This is often a two step process.
GIS tool to use:
From the previous step, you have a layer that contains a polygon representing the danger zone for the chemical spill. If you have access to it, get a copy of a municipal property parcel spatial file. As I’ve mentioned, the first step is to make sure the coordinate reference system of this new file matches the one employed by your buffer zone layer.
From here, you will want to filter out all properties that are not zoned as residential, because we’re ultimately interested in area where people are most likely to live as opposed to parks or industrial areas. Once you have done so, the next step is to generate a new layer representing the centroids of all those residential parcels. A centroid is essentially an algorithm-generated point that represents the centre of a polygon.
To achieve this, use the vector geometry tool called Centroids. This process will output a new layer with a series of points.
3. Points in polygon analysis
Once you have your centroids layer (double check you have a matching coordinate reference system) you want to perform an analysis to get the total number of points (that is, residential properties) that fall within your buffer layer.
GIS tool to use:
While QGIS has a convenient one-click function intuitively called Count points in polygon, under the hood, this tool is actually testing the spatial relationship between each centroid derived from your property data, against the polygon you created using the buffer tool. Using an intersects operation, the function returns TRUE or FALSE for each centroid, ultimately providing you with a total count of all the TRUEs.
To get this final result, select the Count points in polygon analysis tool. After running this process, a new polygon layer will be generated with the exact same content as the buffer zone polygon layer, but will now contain an additional attribute field with the point count. This value is the number of homes in the danger zone that you will report in your story.
From here, you can even take your story a step further and use a nearest-neighbour analysis to identify the addresses of the top 100 homes closest to the chemical spill. With this information in hand, you can elevate the impact of your story by including human voices of those most affected by this disaster.
In terms of the overall potential that GIS brings to your newsroom, this use case is merely the tip of the iceberg. For example, a journalist with more advanced GIS skills can perform vehicle routing analyses to shed light on emergency response times across their community, or even help to identify hotspots or categories of crimes using a clustering algorithm.
There are endless resources online to advance your learning journey, including the official documentation for your tool of choice, YouTube tutorials, forums or websites such as Medium, Stackoverflow, GitHub, Reddit, and much more. My experience has been that the GIS community at large is very welcoming of journalists seeking to immerse themselves into this field of study; reaching out to these professionals to serve as informal mentors can give you a confidence boost when you get stuck or if you are unsure your work is accurate.
...Are you convinced your next learning curve should be diving into the world of GIS? Let us know by commenting below.