Wrestling with dirty data

Name: DataJournalism.com
Price range: $

Finding flaws in data before they find you

14 February 2018

By MaryJo Webster

This article was migrated from datadrivenjournalism.net. It has been edited for DataJournalism.com, but some information may still be outdated.

You’ve got a great story idea. You found the data you think you’ll need. You requested it. And -- perhaps much to your surprise -- it arrived.

Now you’re itching to finally start asking the data some questions.

But I will bet my paycheque that you won’t be able to start the analysis straight away. More often than not data will show up on your doorstep with a problem -- or two, or three, or four. In the data journalism world, we call this ‘dirty data’.

In other words, something is wrong with the data in a way that makes it difficult, if not impossible, for you to ask your questions and get the right answers.

Usually it’s not bad enough to stop your great idea. But first you need to identify the problems.

Trust me: You want to find the flaws in your data before they find you. Because they will.

Unfortunately, there isn’t a definitive list of problems you might encounter because each dataset is unique. But there are some all-too-common issues that I’m going to highlight: inconsistent values, missing data, structure troubles, and what I like to call ‘the agency screwed up’ and ‘it’s too good to be true’ problems.

I’ve collected stories from some data journalist colleagues, in addition to my own tales of woe, to help shed light on the wrestling match you’ll likely have with your data.

Inconsistencies

The term ‘dirty data’ is often associated with the most common problem you might encounter -- values in your data that are inconsistent.

The classic example is campaign finance contribution data that have separate records for each cheque written, and the names of the donors are inconsistent, even though the money might have come from the same person. For example, you may have five records with names listed as: John Smith, John and Sally Smith, John J. Smith, John J Smith (no full stop), and ‘Smith, John’.

Through reporting you’ve confirmed this is the same guy. But no matter which computer software you use to tally up his total contributions, the computer will think those are five different people.

This poses a problem because it makes it impossible for you to summarise the data to find out who contributed the most money to a campaign.

MARYJO-2 — **Example**

Problems in campaign finance data, where names are different from one record to another.

Variations of other values -- or just outright misspellings -- can show up almost anywhere. I had another dataset that had at least 20 different spellings of Minneapolis. I really had no idea you could come up with that many variations. Another big city nearby might show up in data as ‘St. Paul” (with a full stop)’, ‘St Paul’ (without a full stop) or ‘Saint Paul’. Occasionally I’ve also found ‘St. Pual’, of course.

Fields that have codes in them are supposed to be consistent (that’s the whole point of using codes!), but you can’t count on it. For example, you might have a field identifying a person’s gender. The documentation says this should either be ‘M’ or ‘F’. But in the data you have ‘M’, ‘F’, ‘O’, ‘N’, and lots of rows where this cell is blank. In my experience, the bad codes are either typos (most of the time), or relics from old coding systems that didn’t get translated into a new coding system. There could be all kinds of different reasons for the blanks (also known as NULL values).

Or, you might have a field indicating a unit of measurement -- such as a currency value for a salary rate -- but maybe for one person it says ‘10’ and another it says ‘2,000’, without telling you the unit of measurement. Is this an hourly rate or a bi-weekly or an annual rate? And having it all in the same field makes it impossible to do any calculations, such as an average or median.

Outlier values are also red flags. Perhaps you have a date field and you expect the values to be somewhere in the 21st century, but a couple records show dates in the early 1900s. Are those bad dates that are supposed to be in the 2000s?

Finding inconsistencies is easy. Just do a summary of each field or column using a Pivot Table or a group by query, or any other tool you would use to summarise data. You’ll quickly see the problems. Tracking down what the correct values should be is a bit harder, and requires some traditional reporting.

MARYJO-4 — **Cleaning data in spreadsheets**

Pivot Tables are powerful tools for identifying problems. And they aren't the only spreadsheet function that can help you clean dirty data. For more, see Brant Houston's article Spreadsheets for journalism, or our video courses Doing Journalism with Data: First Steps, Skills and Tools and Cleaning Data in Excel.

Missing data

Too often, you’ll discover that an agency doesn’t track what you want or doesn’t do a good job of tracking it.

Many years ago, I got a database of registered underground storage tanks in Kansas, including data on which ones had sprung a leak. The documentation showed there was a field indicating what had caused the leak and I was super excited because this was the focus of my work. But when I got the data, only a fraction of the records had values in this field. The agency told me they weren’t very good at filling in all the details. The field proved useless.

A few years ago, a colleague wanted to find out why foster parents were having their licenses suspended or revoked. Were they abusing the kids? Or neglecting them? Did they commit a crime? So, she requested Minnesota’s database of licence revocations. Turns out the state didn’t put the reason for the revocation into the database -- it was only stored in paper files.

For another project, I got data on police pursuits because we wanted to know how often bystanders -- such as pedestrians or other drivers not involved in the pursuit -- were injured or killed. A state law required every law enforcement agency to report details about each high-speed pursuit to a state agency. Our newspaper had written about many of these, so we pulled up the old stories and checked to see if those incidents were in the database. Turns out some of them weren’t. We also found cases where someone had died but the data didn’t indicate that. We learned that sometimes this was the fault of the state agency for not entering the information into the database, and in other cases it was because law enforcement agencies did not submit anything.

Structure problems

The main way to know you’ve got a structure problem is if you can’t figure out how to set up a Pivot Table or run a query that will get the answer you want. Your data table maybe has rows that should be columns. Or there are multiple rows for each thing that you want to count. Or it’s missing something you need.

For example, you may want to know which neighbourhood in the city had the most reported crime. But the crime data you have only has the address, not the name of the neighbourhood. This is an example of where you need to ‘categorise’ the data by creating a new field to assign each record a category (in this case, a neighbourhood name).

Another example: in 2017, I worked on a project about police officers who had been convicted of criminal offences. The court conviction data came to us with one record for each charge. For instance, if an officer was charged with driving under the influence and reckless driving for the same incident, there would be two records even if one of the charges was later dismissed.

We wanted to count up how many convictions there had been -- and by this, we really meant incidents or cases, not charges. I realised that this counting would be much easier if the data table had one record for each case, so I spent some time re-arranging to make a new table. The table had one record that identified the number of charges and the highest level of conviction (such as felony, gross misdemeanour, or misdemeanour), plus some other pertinent details.

MARYJO-6 — A snapshot of the data, showing multiple records per case.

I could write a whole separate article just about structure problems. The bottom line is that it happens often and is something you should assess at the outset.

Skewed data

These are situations where flaws end up in the data because of the data collection or inputting process by the agency.

If you’re lucky, the data keepers might tip you off. If you’re unlucky, you won’t notice it until you end up with strange results in your analysis. To be safe, the best thing to do is to ask a lot of questions about how the data ends up in the database.

Janet Roberts, head of the Reuters data team, told me this tale about data she worked with while at the New York Times. For a project about the influence of drug company payola, Janet requested Minnesota Medicaid prescriptions data. They hoped to identify the top doctors who prescribed antipsychotic medications to children and compare that list to another database listing drug company payments to doctors.

After they’d had the data for a while, a flack for the Division of Human Resources sent an email: “By the way”, she wrote, “you should know that we estimate that 20 percent of the prescriber names are inaccurate”. When Janet asked how they arrived at that figure, they were unable to provide any valid methodology.

Janet started examining the data and found really strange things; for example, an ophthalmologist in northern Minnesota prescribing antipsychotics to a kid who was hundreds of miles away in Minneapolis, and the same patient getting prescriptions for the same drug from different doctors in very erratic patterns. Janet started inquiring how the data were collected.

The prescriber information was entered at the pharmacy level, most likely by the clerks who take your prescription. They got the name of the physician from whatever was scribbled on the signature line. You can imagine how many times they got it wrong. Janet said they were unable to quantify the level of error in that crucial field, so they had to abandon the idea of identifying the top prescribers.

John Perry, from the Atlanta Journal Constitution, told me about using crime data in Oklahoma City for a joint project between the city’s newspaper and a local TV station. The plan was to identify the ‘rape hotspots’ in the city. Turns out, the two big hotspots were at the downtown police station and the University of Oklahoma Health Science Center. In other words, the location field in the data was sometimes where the cops were standing when taking the rape report -- not where the crime occurred.

Tom Torok, a now-retired data journalist, told me a story that he used to warn students about the validity of data. A friend of his worked for a computer firm that was hit with a lawsuit alleging racial discrimination in hiring and promotions. The company used its personnel database to demonstrate that it was hiring and promoting many more blacks than it was required to; the lawsuit was dismissed. When the company honchos announced the lawsuit victory at a board meeting, the secretary responsible for inputting the data turned ashen. When asked what was wrong, she explained that when inputting the data, the race field was a required field but there was nothing on the company paperwork to indicate race. So, she said, she looked at the name and guessed the race.

Agency screw-ups

Sometimes you just get bad data handed to you.

My worst nightmare came true when working with data from the St. Paul Public School District. A colleague and I had requested all the payments they had made -- essentially the equivalent of their cheque book register. We thought it would be fairly straightforward, and when we ran into questions or problems we did exactly what we were supposed to do -- asked the school district for help.

The first red flag was that some records appeared to be duplicates -- hundreds of them -- but weren’t exactly the same. The recipient would be the same; the date was the same; the description of the expenditure was the same; the codes for which pot of revenue it came from were the same. But one cheque would say something like $5,000, and the other would be twice that amount.

We asked the school district to explain this and they said it was all legitimate. They assured us there weren’t any problems. We should not have listened to them. Turns out they included voided cheques, but they didn’t include a field indicating which records were voided checks and which were not. To make matters worse, the way they handled voided checks was to simply double the dollar amount on the voided record. So, when we tallied up totals, we were way off.

Another schools story by a Star Tribune colleague narrowly avoided disaster. The reporter saved the day because she kept asking questions. Her data showed that a Minneapolis school had dropped its chronic absenteeism rate astonishingly fast.

At first, she and her editor thought this would be a great feature story about this school solving a big problem. Her editor asked her to include a chart with the story showing how this school district compared to others in the area. Problem was that the data she had was only for the Minneapolis school district. Data for the other districts was at the state education agency, and it turns out the two datasets used different methods for measuring chronic absenteeism.

MARYJO-10 — The data provided from the school district, showing that the chronic absenteeism rate dropped from nearly 25% in 2014 to just over 5% in 2016.

And then the state’s data for that standout school showed they had not dropped their chronic absenteeism rate dramatically.

The reporter started asking both the school district and the state agency to clarify. She got stonewalled for weeks. Nobody had a good answer -- they just kept saying the two measurements were different because they used different definitions.

Still, that astonishingly fast drop in the rate seemed ‘too good to be true’ to the reporter. Turns out it was. Just a few days before publication, the school district finally dug into their own data and discovered the numbers for that one school were incorrect.

The common thread

The common threat you can take away from all these tales is that you really need to get to know your data and try to do as much of that research as possible before making a request.

How does the information end up in the database? What fields did the agency transfer, or not transfer, to you? How is each field used? What codes are, or are not, supposed to be in there?

Then you need to spend time looking for inconsistencies or missing data that you expected to be there. If possible, cross check your data against paper records or other sources. Aggregate your data to try to make it match published reports.

Finally, it pays to be skeptical. The method of last resort is always watching out for the ‘it’s too good to be true’ results.

For other potential problems and how to solve them, check out the Quartz Guide to Bad Data.

Wrestling with dirty data - Finding flaws in data before they find you

12 min Click to comment

Longform reads

Verification Handbook

Data Journalism Handbook 2

New course

Quality journalism

Countering hate speech

New course

Video course

Fundamental search for journalists

Popular course

Coding

Python for journalists