Visualising data is like solving a jigsaw puzzle. To be successful, there are some things you should know in advance -- the scene to be revealed by the pieces is the story you'd like the data graphic to convey. You also need to understand what you have at your disposal -- the pieces of the jigsaw are the set of data in front of you. And then it’s your job to piece together the puzzle, or assemble the elements of your graphic, to tell your data’s story.
Visualising data is a choice. Instead of words, we elect pictures. As the adage goes: one picture is worth a thousand words. The perceived power in the visual medium derives from its efficiency and multidimensionality.
Consider the following summary of the state of the world's health and wealth, drawn from data assembled by the Gapminder project:
The last 50 years (1965-2015) have seen tremendous progress in both the health and wealth of nations. The gain in life expectancy has been nothing but remarkable. In 1965, Iceland topped all nations with the average citizen living to 74 years. By 2015, almost all of Europe, most of the Americas, the majority of Asia, and even a selection of African countries have reached or exceeded that level. In the last 50 years, much of the world has become richer, when measured using GDP per capita, PPP inflation-adjusted. In 1965, the majority of nations earned below US$5,000 per head; by 2015, they have lifted incomes above US$10,000. Switzerland with an average income of US$32,000 in 1965 remained one of Europe's richest nations, although, by 2015, it's been overtaken by Ireland, Norway and Luxembourg. Many African nations also became wealthier, with the prominent exception of Libya.
Now, let's take a look at a visualisation of that data as a pair of scatter plots: the power of the visual medium is palpable.
The mind readily discerns the various talking points detailed above. In text, information arrives one nugget at a time, in a prescribed sequence. In pictures, our eyes wander, foraging for information along multiple dimensions at once. Cognition is guided by design elements such as reference lines, legends, data labels, and annotations. Of note, the richness of the visual medium allows complex relationships to surface, which, when expressed verbally, lead to long-winded, caveat-laden sentences.
The efficiency and multidimensionality of the visual medium arise from a set of conventions and rules, which regularises the communications between producers of data visualisation and its consumers. These conventions and rules are often unspoken: it's the visual equivalent of saying ’it goes without saying’ .
Imagine if that global health and wealth graphic was also supplied with an additional ‘How to Read this Chart’ box, as seen below:
Stop! You want to scream at me: most of those words aren't necessary. Your objection is sustained. Including the ‘How to Read It’ box belittles the advantages of the visual medium. Lengthy instructions are obviated when designers follow certain conventions and rules that are intuitively grasped by readers. It goes without saying.
In the remainder of this Long Read, we’ll highlight a core set of conventions and rules that should guide our production of data graphics. The references listed at the end provide further explanation of these and other rules. Chapter 1 of information designer Alberto Cairo's book, How Charts Lie, is highly relevant, as he outlines how consumers should read charts from a producer's point of view. Alberto emphasises the possible existence of ‘mental models’ of data visualisation such that visual communications succeed when the designer's model converges with that of the reader's.
Most conventions and rules in data visualisation are not unique -- in some cases, competing, contradictory conventions co-exist. Rules -- such as how we handle colours -- evolve over time as tools improve. Every convention has its exception: when our design deliberately turns against a rule, we call our readers' attention to the aberration, including, when appropriate, providing a ‘How to Read this Chart’ box.
In Leland Wilkinson's The Grammar of Graphics, he distinguishes aesthetics -- the encoding of data into geometric objects -- from guides, which assist understanding. Following this distinction, I organise the conventions and rules of data visualisation into two groups. In each section, two displays of the same data are juxtaposed, one conforming and one diverging from the spotlighted convention or rule, to reveal the rationale behind these best practices.
Conventions on aesthetics
The pie chart endures despite being the chart form most maligned by data visualisation experts. Some pie charts are serviceable, provided that they follow appropriate conventions.
Let's look at an example lifted from a note by data visualisation developer Xan Gregg. These two pie charts display languages used on the internet:
Example one tells readers English is used on over half of the internet, while each of six other languages from Russian to Japanese accounts for about five percent.
In constructing this pie chart, I followed a number of conventions: a) use a reasonable number of slices, aggregating minor categories if necessary b) order the slices by size from the largest to the smallest c) place the ’Other’ slice at the end of the sequence, regardless of the order scheme d) position the first and largest slice against the upper vertical radius, and arrange the other slices in a clockwise fashion e) vary colours only if the colours are encoding data. In this case, I used a lighter shade for the ‘Other’ slice, signalling that it alone consists of multiple languages, and that it is the least important slice on the chart.
Most conventions and rules in data visualisation are not unique -- in some cases, competing, contradictory conventions co-exist.
These rules are unspoken. The designer invokes them silently, and the reader applies them intuitively. When such rules are overlooked, it takes more time to digest the pie chart. Take a look at the diverging example, in which the largest pie slice is placed at a random angle, other slices run in a random order, and each slice is assigned an arbitrary colour. When the chart maker diverges from conventions, the reader must devote time to figure out the logic of the design.
The principal convention on a bar chart (and by extension, a column chart) is the start-at-zero rule, which stipulates that the lower limit of the value axis should be set to zero. Our next example, adapted from The Economist, is a specimen that does not follow this convention.
On this chart, the reader understands the retirement age in Switzerland to be twice that in France, since the Swiss bar is twice the width of the French bar. That last line can't be true, and it isn't true: the Swiss retirement age is only 10% above that of the French. When the value axis is extended to zero, as in our conforming example, the ratio of the bar widths is restored to the ratio of the data.
The alert reader notices that the designer of example one has planted a break symbol on the left edge of each bar, signalling that its width is truncated (by more than half). Thus, the maker knowingly defies the aesthetic convention on bar charts. Acknowledgement does not fix the distortion introduced by the truncation, leading to probable misinterpretation.
Admittedly, the revamped bar chart is still short of adequate. A more effective display is achieved by switching to a dot plot as shown in example three. Another effective display option is to focus on the gaps in effective versus official retirement ages as shown in example four. Both of these designs work around the start-at-zero rule.
A scatter plot depicts each unit of data as a dot on a surface spanned by two axes. The horizontal (x) and vertical (y) positions of the dot encode two variables. The shape of the cloud of dots visualises the nature of the correlation between the two variables. Enjoy the splendid scatter plot in our next example, which singles out the United States as an outlier nation, in which outsized healthcare spending failed to produce the expected lift in life expectancy.
A convention governs which variable to place on which axis. In this example, per capita healthcare spending is purported to be a driver of health outcomes. By convention, healthcare spending (the explanatory variable) is encoded as x, and life expectancy (the outcome) as y.
In comparison, the x- and y-axes are swapped in our diverging example. Its visual form is the reflection of the same data across the 45-degree diagonal.
Because the design usurps the convention, many readers, especially those with training in STEM fields, will react with confusion, and even annoyance. While there is no clear design imperative for this rule, a strong scientific justification prevails.
A routine add-on to the scatter plot is the regression line (also misleadingly called a ‘trendline’ by the market-leading spreadsheet program Excel). Regression analysis quantifies the correlation between the two variables displayed by a scatter plot. The regression line is produced so as to minimise the average distance between the line and the cloud of dots. Our next scatter plot includes a regression line.
Most importantly, the distance between a given dot and the regression line is measured vertically -- not horizontally. This vertical separation is also the cue by which the reader learns the chart's key message: that Americans should have been enjoying life expectancy of over 82 years, given their level of spending, if additional spending translated to incremental years of life at the same rate as in other countries.
Swapping the x- and y-axes does not reflect the regression line (as it does the dots). For what minimises the vertical distances between dots and the line does not minimise the horizontal distances. As illustrated in example two below, where I reversed the axes of example one above, the regression line of x on y does not coincide with the reflected regression line of y on x.
Notably, this convention does not dictate which variable should be the explanatory variable, and which the outcome variable. After the designer decides these roles, the convention governs which variable is assigned to which axis. To wit, example two is appropriate if life expectancy is offered as an explanation for the variability in healthcare spending. See my blog post for more on this topic.
The natural place for a time variable, such as years, months, and dates, is on the horizontal axis. By convention, time runs left to right (substitute right to left in right-to-left (RTL) countries). The following pair of charts shows the rapid growth of Chinese tourists visiting Australia. Compare example one, in which time runs left to right, to example two, in which time runs bottom to top. The left-to-right convention is an unspoken rule shared between producers and consumers of data graphics in cultures that read left to right. Veering off this rule always slow down cognition.
Another rule for time-series charts is proportional spacing. When data are collected at uneven intervals, the tick marks on the time axis should mimic the irregularity. Otherwise, the chart distorts the pace of growth. In another diverging example below, the growth trend appears to be linear, rather than ’hockey stick’, an artefact of applying even spacing to unevenly-spaced data.
In the social media age, colour has become a favourite complement to any data graphic. Here are several main conventions guiding the application of colour:
a) Put a cap on the number of colours. As Dona Wong suggests in The Wall Street Journal Guide to Information Graphics, "admit colors gracefully, as you would receive in-laws into your home."
b) Same colour, same data; colour difference should reflect data difference. This rule disqualifies arbitrary assignment of colours.
c) Use certain colour pairs with care, as they are loaded with meaning. In the business community, black is positive, and red is negative but in some cultures, black is ominous while red is auspicious. For heatmaps, red is hot, and blue is cold, while in US politics, the red-blue colour pair denotes the two major political parties. As I noted at the start, conventions sometimes clash.
d) Many authors recommend making charts friendly to colour-blind readers, for example, by inspecting a version in grayscale.
Let’s look at two variations of our bar chart that shows retirement ages in 13 countries. On the left, the bars are assigned meaningless colours, diverging from rule b. The reader gets confused by the false signal, searching fruitlessly for the data behind the colour scheme. On the right, the design uses a unique colour for every bar, defying rule a. Here, the colour palette becomes a distractor, diminishing one's speed of understanding.
The two plots of example two derive from the bar chart that plots the gaps in retirement ages. By the designer's choice, a positive gap means the effective retirement age exceeds the official retirement age. The chart on the left applies green to positive gaps, and red to negative gaps. Rule d advises against pairing red and green hues, because a red-green colour-blind reader cannot distinguish between them. The chart on the right encodes positive gaps in red, and negative ones in black. This choice of colours is confusing because of the convention, particularly popular in business, of using red ink for negative numbers.
Conventions on guides
Chart designers add guides such as legends, axes, gridlines, and labels, with the express purpose of accelerating cognition. As Edward Tufte and other experts have pointed out, such guides sometimes backfire when poorly executed. In response, a large set of conventions and rules has been developed.
It goes without saying that axes have canonical directions. On the vertical axis, larger values are placed above smaller values, while on the horizontal, larger values are placed on the right of smaller values (except in RTL countries). Usurping these rules results in nonsensical charts.
In 2014, Reuters published the following line chart that promptly unleashed a tweet storm in the data visualisation community.
The Stand Your Ground law, which legalises using deadly force for self-protection, was widely expected to worsen gun violence, and yet this chart depicted a downward trend upon its enactment in 2005. Upon discovering the inversion of the vertical axis, readers realised that their intuition of ’lower is less’ has been misplaced. Reactions were scathing. A college professor complained: "It is so deeply misleading that I loathe to expose your eyeballs to it”. This tweet storm shows why designers should follow the conventions unless there is a compelling reason not to.
If a time dimension is involved, the convention is to place time on the horizontal axis. In a scatter plot, the outcome variable should be coded to the vertical axis, and the explanatory variable to the horizontal axis.
Two other unspoken rules -- on limits and tick marks -- inform the design of axes. Reasonable limits are chosen to remove excessive white space from the plotting surface. Tick marks should fall on easily interpretable increments and values; for example, the sequence [0, 20, 40, 60, ..., 120] instead of [2, 22, 42, 62, ..., 122], or worse, [2.3, 22.3, 42.3, 62.3, ..., 122.3].
The following pair of charts is identical, except for the axis labels. They both convey the message that Chinese tourists entering Australia have outnumbered those from New Zealand since 2017. The more precise labels in example two are harder to grasp.
Almost all charts include a legend. A colour legend is commonly found on line charts, bar charts, pie charts, bubble charts, and more. The first rule for legends is to not use a legend if direct labels are feasible.
On a line chart with a bundle of lines, it is usually preferable to place labels next to the lines, rather than inside a legend box. It goes without saying that the colours in the legend must correspond one-to-one to the colours on the chart itself, and that the order of appearance should mimic that on the chart. Popular software such as Excel frequently makes a mess of this rule, showing the reverse order inside the legend box as items appear on the main chart.
In these graphs, featuring the eye-popping rise in Chinese tourists visiting Australia, the line labels follow the rank of the tourist counts in 2018, the most recent year with data. Example one, with direct labelling, reduces the amount of head-shaking to connect the legend key with the line.
For bar or column charts, if a legend box cannot be avoided, the convention is to place it above the chart below the titles, as readers rely on the information to interpret the graphic. The order of categories should mirror the orientation of the data.
The National Post portrayed results from a survey of attitudes toward immigrants in a series of paired column charts, one of which is reproduced in example three. These charts adopt the conventions of ordering the countries according to the proportion of respondents who agreed with each statement, with the colour legend placed on top below the chart title. By contrast, example four uses an alphabetical ordering of countries, and a right-sided legend, which significantly complicates cognition.
An emerging convention is to embed the legend into chart titles or subtitles. Applied to the Australian tourism data graphic, the coloured text in example five points the reader's eyes to the key countries of China and New Zealand. Example six requires the reader a tad more effort to link up the chart title and the line labels.
How items are ordered on a chart has an outsized effect on the reader's comprehension. Despite software's predilection for the alphabetical scheme, it is rarely the right choice. Howard Wainer, the author of Visual Revelations and other books on data visualisation, derided this as "Alabama first!" (Alabama is the first state in alphabetical order.) Convention calls for using the natural order when it is available. Time variables, age groups, income groups, education levels, and so on all have natural orders.
Example one shows the relative popularity of crime movies across age groups in the United Kingdom, as compiled by researcher Stephen Follows. The age groups are presented in natural order, from the youngest to the oldest. Ordering by value, as seen in example two, does not work well with data that have a natural order, as the eyes jump around to re-establish the sequence.
When making a panel of plots, the rule is to retain the same order of values across all charts. The pair of plots in example three, adapted from the previously-cited study of attitudes towards immigration by The National Post, illustrates why switching the order of countries from chart to chart hinders the reader's ability to compare responses of the two survey questions.
We should lock the order of the countries throughout the panel, as shown in example four. Countries are laid out from left to right by the decreasing proportion of respondents who agreed with the first statement.
Text used sparingly complements the visual experience. Many authors recommend using informative chart titles. The designer must replace the default chart titles assigned by graphing software, typically formed from concatenating the axis titles. Another rule is to explain all acronyms and jargon. It is also conventional to include the source(s) of data in a footer.
When labelling data, the rule is to label items that are key parts of the story. Don't label everything. The labels, in effect, provide cues to readers as to the most significant items. Example one reproduces an earlier chart examining the effectiveness of healthcare spending, with the full set of country labels. Too many labels contend for the reader's attention.
When to ignore conventions
A convention arises when a plurality of practitioners agree on the wisdom of an element of design. Some rules have cognitive rationales supported by scientific experimentation, which appear to be neither sufficient nor necessary for their popularisation. Researchers Bill Cleveland and Robert Kosara have conducted some of these investigations. But almost every convention has exceptions. My advice is: think twice before you break a rule but don't think twice if you must.
I have come across many examples of charts in which one or another convention is justifiably discarded to improve understanding. Let me end this Long Read with an example in which rule-breaking pays dividends.
Imagine using example one to convey a message to American readers that the US dollar has been strengthening against the Euro since 2018. The visual impression of a trend line running down conflicts with the strengthening message. Because the exchange rate is expressed as the number of US dollars per one Euro, the lower this number, the stronger the US dollar. One band-aid to this visual challenge is to place annotations, as in example two.
Such annotation merely sets the goalpost for a puzzle that the reader must resolve. Why does a lower line represent a stronger US dollar?
In this situation, the designer may as well break the axis rule by inverting the vertical axis. Example three is identical to our second example, except for the axis inversion (and the consequent flipping of the labels).
Another way to achieve this effect is to invert the exchange rate ratio, expressing it as the number of Euros per one US dollar. This solution won't please the financial community who are accustomed to looking at the US-dollar-to-Euro ratio.
The visual medium excels at conveying a large amount of information in multiple dimensions efficiently. Such efficiency relies on a set of unspoken rules and conventions, shared implicitly between producers and consumers of data graphics. In this Long Read, we’ve reviewed a selection of major conventions covering both aesthetics and guides of charts. Designers of data visualisation can exploit these conventions to simplify their graphics, removing unnecessary explanations. Recognising these unspoken rules helps avoid unintended misunderstanding. As with all visual design, depending on your specific application and audience, it may occasionally be prudent to defy convention. Lastly, think twice before you break a rule, but don't think twice if you must.
Recap of conventions
Conventions on aesthetics
- Use a reasonable number of slices
- Aggregate minor categories into one ’Other’ slice
- Order slices by size from largest to smallest
- Place the ’Other’ slice at the end of the sequence, regardless of the order
- Position the first and largest slice against the upper vertical radius
- Arrange slices in a clockwise fashion
- Vary colours only if the colours are encoding data
- Start value axis at zero
- Place explanatory variable on horizontal axis
- Place outcome variable on vertical axis
- If adding a regression line, assign an outcome variable
- Plot time on the horizontal axis
- Time runs left to right
- If time intervals are uneven, tick marks should be uneven in the same way
- Limit the total number of colours
- Colour difference should reflect data difference
- Use certain colour pairs with care
- Make charts friendly to colour-blind readers
Conventions on guides
- Use canonical directions (ie. larger values to the right of smaller values)
- Time goes on the horizontal axis
- Place an outcome variable on the vertical axis
- Choose limits to remove excessive white space
- Tick marks should fall on easily interpretable increments and values
- Use direct labels if feasible
- Colours in the legend should correspond one-to-one to the colours on the chart
- Colours in the legend should be presented in the same order as they appear on the chart
- Place legend on top below the title 5.Embed legend into chart titles or subtitles
- Place values in the natural order when it is available
- Avoid the default alphabetical order unless it is justified by the context
- Retain the same order across all plots in a panel of charts
- Use informative chart titles
- Explain all acronyms and jargon
- Include the source of data in a footer
- Label only key items, not all items
The unspoken rules of visualisation - (and when to break them)22 min Click to comment