Ensuring Accuracy in Data Journalism

#Data Smells

It is often said of data journalists that they interview data just like traditional reporters interview traditional sources. A corollary is that data must be subject to the same scrutiny traditional sources are. Not all sources can be readily trusted and neither can all data sets. But how do data journalists go about determining whether or not a data set demands closer inspection?

Inspired by the idea of code smells, we take on the task of developing a list of "data smells", that is, red flags in data sets that may indicate data is not be reliable for journalistic purposes. The term "data smells" was suggested by The New York Times' Jacob Harris at the 2014 NICAR/IRE conference in Baltimore. This is not by any means an exhaustive list; feedback and contributions are welcome.

###Data Smells in Dirty Data

When data is unreliable or untrustworthy, we say data is dirty. Now, what exactly makes data dirty?

"Broadly, dirty data include missing data, wrong data, and non-standard representations of the same data. We say that data is dirty if the user or application ends up with a wrong result or is not able to derive a result due to certain inherent problems with the data." (Kim et. al., 2001)

At this point, we make a distinction between two types of wrong data: unintentionally inaccurate data and doctored or fabricated data. For now, we will take "wrong" in the context of dirty data to mean "unintentionally inaccurate". We treat fabricated data as a separate case, which we will look at later. It is trickier to detect fabricated data than dirty data.

Kim et. al published a taxonomy of dirty data on the Data Mining and Knowledge Discovery journal. In order to study dirty data in the context of journalism, we'll adopt their taxonomy and use it as a baseline to list different types of data smells and provide journalistic examples to illustrate each type. For the purposes of this project, we will focus on numerical and string data (and exclude multimedia data such as audio and video because these require further considerations).

Based on Kim's taxonomy, we divide data smells in three broad categories:

missing data
wrong data
unusable data

We'll look at each case closely, but before we do so, it is worth going over a concept that is fundamental to our discussion: frequency table.

A frequency table of a given variable is simply a table that displays the frequency of each value of the variable of the data, i.e. a table that shows how many times each value occurs. Sometimes it also displays the percentage frequency of each value.

For example, consider the following data set of FIFA World Cup editions and hosts:

YEAR	HOST
1930	Uruguay
1934	Italy
1938	France
1950	Brazil
1954	Switzerland
1958	Sweden
1962	Chile
1966	England
1970	Mexico
1974	West Germany
1978	Argentina
1982	Spain
1986	Mexico
1990	Italy
1994	United States
1998	France
2002	South Korea and Japan
2006	Germany
2010	South Africa
2014	Brazil
2018	Russia
2022	Qatar

A frequency table of the variable 'Host' simply shows a count of how many times each country listed has hosted the FIFA World Cup plus the percentage frequency of each country as host. The frequency table is as follows:

HOST	Frequency	Frequency(%)
Argentina	1	4.545%
Brazil	2	9.0925%
Chile	1	4.545%
England	1	4.545%
France	2	9.0925%
Germany	1	4.545%
Italy	2	9.0925%
Mexico	2	9.0925%
Qatar	1	4.545%
Russia	1	4.545%
South Africa	1	4.545%
South Korea and Japan	1	4.545%
Spain	1	4.545%
Sweden	1	4.545%
Switzerland	1	4.545%
United States	1	4.545%
Uruguay	1	4.545%
West Germany	1	4.545%

They can be done in Excel or Google Refine. For more on frequency tables and how to construct them, please see the Tools section below.

We now move on to look at the three broad categories of data smells: missing data, wrong data and unusable data.

####1. Missing Data Oftentimes data sets contain missing entries. This might be because the specific data point is unknown, for instance. Now, if missing data is allowed, all is fine, but if missing data is not allowed or not expected, data is considered to be dirty.

Data Smell #1: Blank cells in a spreadsheet

The simplest case of missing data is the mere presence of blank cells in a spreadsheet. For instance, imagine a data set of campaign donations in which a given entry contains the donation amount but not the name (or some identifier) of the donor.

The fact that a data set is incomplete when it shouldn't be will often become a story in itself, such as the following story published in the UK: London councils lack data on children missing from care, finds report. London Boroughs are legally responsive for a number of children in terms of care and protection but some boroughs lack data on the number of children missing from care. The incomplete data set is available here.

For a small data set, detecting the presence of blank cells is as simple as opening the document and visually inspecting it in search for blank cells. The larger the data set is, the more difficult it is to visually detect blank cells. Fortunately, frequency tables come to the rescue: this task can be automated in Excel or Google Refine. One simply needs to check for the frequency of 'blank' values. If 'blank' occurs at all in the data set, it might be worth it investigating why that is. The question of whether or not data is allowed or expected to be missing often comes down to a judgement call and may require further reporting to be answered.

Note that there is a distinction between 'blank' and 'empty' values. When checking for blank entries, be careful with whitespace and tab characters, which look 'blank' but are not 'empty'. The data set at hand might contain entries that look blank but are actually filled with whitespace and tab characters. This scenario might occur when the protocol for manually entering data into a system does not allow for missing values. When faced with missing data, the human inputting data into the system enters an invisible character such as tab or whitespace to bypass this constraint. No search function in Excel, Google Refine (or really any programming language) will classify these entries as empty because they actually contain a character (albeit an invisible one). In short, make sure to look at the frequency of whitespaces and tab characters as well as the frequency of 'empty' cells.

Data Smell #2: Gaps in Data

A slightly more evolved case of missing data is when the data set does not contain 'blank' entries but data is missing anyway, i.e., there is a gap in the data.

For example, a data set of illiteracy rates across the U.S. might contain data for only 49 states when there should be data for all 50 states. If the 'states' column lists all 50 states but the 'rates' table contains only 49 values, then this is data smell #1: an instance of blank cells in a spreadsheet. However, if the 'states' and the 'rates' column both contain 49 values, it wouldn't be evident from the analysis of the frequency table that data is missing.

Back to the FIFA World Cup example, note that the table showing the host country for each edition has an entry corresponding to every four years since the world cup's first edition in 1930, except for the years of '1942' and '1946'. This is an example of gap in data. Further investigation shows that the FIFA World Cup did not take place in 1942 and 1946 due to World War II. This example comes to show that a gap in data is not necessarily a mistake, but it is something that should nonetheless be always looked into.

Besides constructing a frequency table, one can detect this smell by comparing the data set against another data set of the same scope ('national', in the example above) which is known to be complete. This can be done by joining the two data sets. For more on joins, see Tools section below.

Time series plots are particularly useful to detect gaps in temporal date. The following story by the Wall Street Journal is based on a data set that contains gaps: Five Revelations From the U.S. Embassy’s Beijing Pollution Data Dump. There's a footnote on the graph alerting to the fact that no data is available for Nov. 7, 2008 - Feb. 16, 2009. Note how the lack of data for these dates stands out in the time series plot.

=========================

It might be the case that the remainder of the data set is accurate, in which case we don't want a few instances of missing data to keep us from telling a story. There are a couple of possible ways to handle the lack of information for a given entry:

Simply ignore incomplete rows.
Perform interpolation to approximate value of missing data.
Call up the institution responsible for the data set and ask for missing information or reason why it's missing.

Be aware that these might not apply in a given situation. In the case of data on missing children in London, for example, it wouldn't make sense to approximate data for one neighborhood by interpolating data from others, since an exact count of children is the goal.

It did make to sense to interpolate data in the following story published by The Guardian's Data Blog: Inequality 'worst since second world war' In this case, missing data on the income share by portion of the population have been estimated by interpolation of neighboring values because these data rarely undergo sharp variations from one year to the next.

####2. Wrong Data The following is a fairly comprehensive list of instances where data might be wrong. As a reminder, we take wrong data to mean unintentionally inaccurate (as opposed to doctored.)

Data Smell #3: Wrong Type

A column might contain values of one data type when a different type is expected. For example, the 'age' value for some data entry might be a string (such as 'abc') when the values are expected to be integers (such as '12' or '3').

In broad terms, data types are either of type numeric (such as integers) or type string, as The Tow Center For Digital Journalism explains in a handy blog post on Code Vocabulary. Poynter has a nice tutorial on how journalists can use Excel to advance their reporting which relies on the

Frequency tables will readily pick out these inconsistencies, all you need to do is search the table for values that are not of the expected type. In our FIFA example data set, for instance, the two rows below would fit the bill. In the first row, attribute 'year' has a value of type string but a numeric value is expected. In the second row, attribute 'row' has a value of type numeric but a string value is expected.

YEAR	HOST
ABCD	Uruguay
1934	99999

Data Smell #4: Outliers

Outliers are data points that significantly deviate in some way from the expected or observed trend of the entire data set. The data smell related to outliers refers to the case when data values fall outside the expected range. An example would be a data point with a value of '225' for a person's age, which one might expect to range from '0' to about '115'. These are usually due to errors in the data entering process. An example of outliers for string data would be a value of 'D' for a person's gender, when 'M' or 'F' are the expected values.

Outliers occur due to either an error in data collection or an interesting, perhaps newsworthy event. For instance, '225' in the example above can very well result from a misspelling of '25'. Likewise, data value 'D' for gender could be a typing mistake, seeing as letters 'D' and 'F' are side by side on the keyboard. An example of an interesting outlier event is this story by The Chicago Tribune, which shows that a small group of healthcare professionals billed the federal government at the highest rate for more than 70% of their Medicare visits, deviating significantly from the overall trend.

Specifying the expected range of data values is often a difficult task which requires further reporting. As a result, it is often difficult to determine whether outliers are erroneous data. Possible remedies are ignoring entries whose values fall outside the expected range or contacting the publisher of the data set to ask for clarification and/or correction.

Whether outliers stem from errors or interesting events, they merit close inspection. If there is an error in data, it is paramount to remedy the error before going further with the data analysis. On the other hand, if the data point has been accurately measured, its significant deviation from the overall data trend will often be a story worth reporting on.

Unsurprisingly, identifying outliers and reporting on their causes or consequences is a classic approach to data journalism. An outstanding piece of journalism that resulted from the investigation of outliers in a data set is this story by The Palm Beach Post's Lawrence Mower, which uncovered a lottery fraud scheme. Mower analyzed a data set of Florida's lottery winners from 1997 to 2013 and observed that some store clerks and owners had cashed in hundreds or even thousands of high-value winnings tickets, while the vast majority of winners had claimed a prize only once. The reporter eventually found out that store clerks and owners would tell winning customers that they had not won and would then cash the prize on their behalf. This is a great example of data set outliers leading to the uncover of serious wrongdoing. The Columbia Journalism Review published an article dissecting Mower's reporting process.

Frequency tables are again useful to detect outliers because they list all values that occur in a given column and as a side bonus, provide the range of values for each column. One might detect these smells by examining simple statistics such as maximum, minimum, quartiles and percentiles for each column. Please see the Tools section below for pointers to tutorials on how to use these functions in Excel.

Data visualizations such as histograms or scatter plot will also allow you to detect outliers. The graph below plots gun ownership (x-axis) versus homicide rates (y-axis) by country. Data was published in 2012 by The Guardian Data Blog. Note how Honduras and The United States are clear outliers in the plot: the former has a much higher homicide rate than all other countries while the latter has a much higher gun ownership rate than all other countries.

Homicide Rate vs. Gun Ownership By Country

Data Smell #5: Sharp Curves

Abrupt shifts in temporal or spatial data are usually an interesting finding which calls for close examination. Just like outliers, these shifts might either be a story worth reporting on or the result of some error in data collection. Either way, they are worth investigating. The best way to spot sharp curves in the case of temporal data is by constructing Edward Tufte's slope graph or a time series plot. In the case of spatial data, color-coded maps are the way to go.

This article by The Wire on the 2014 press freedom ranking by country provides examples of both cases. In terms of temporal data, the sharp drop seen by the U.S., which fell 14 spots from 2013 to 2014, did not go unnoticed upon the release of the 2014 ranking. By comparing the rankings over time (i.e. in a time series,) the reporter was able to spot the sharp fall observed by the U.S. in the ranking.

As for spatial data, note how the color-coded map illustrating the article makes it easy to see that Namibia stands out positively among its neighboring countries while Sudan stands out negatively in Africa. The contrast between the color codes is a visual representation of the sharp rises and drops between the indexes of neighboring countries.

The following smells are 'internal consistency checks'. Internal consistency is the idea no two data points within a data set nor two data sets within a database contradict each other. The smells below refer to instances of violation of this internal consistency standard.

Data Smell #6: Conflicting information within a data set

This data smell arises when cross-field validation is not observed. Cross-field validation refers to the idea that conditions based on multiple data points must hold. For example, all percentage frequencies in the FIFA data set must add up to 100%. Another example would be the idea that the start date of some event must be earlier than the end date of the same event.

A common instance of conflicting information within data sets occurs when there are repeated values for an attribute which expects unique values only. For example, social security numbers are expected to be unique

constraints

http://www.nbcnews.com/business/consumer/odds-someone-else-has-your-ssn-one-7-f6C10406347
two entries for the same employees
doesn't add up to 100%!

Data Smell #7: Conflicting information across multiple data sets

Much of data journalism is about comparing data sets and drawing interesting conclusions from the comparison. However, it might be the case that two data sets contain contradictory information, in which case the reporter must closely examine the contradiction.

For example, a reporter might find that a company's total number of employees (listed, say, in a directory of employees) is lower than the sum of the number of employees by department. Perhaps this is because some employees work in multiple departments, the reporter might find after some investigation. Perhaps there is an error in some data entry. Either way, this is a red flag warning the reporter about the need for further investigation before the actual data analysis.

The Tampa Bay Times' Matt Waite wrote an informative piece titled Handling Data About Race and Ethnicity on Source describing his misfortune in ignoring this data smell. Waite explains that 'Hispanic' was recorded as a race in Florida's voter registration database and as 'ethnicity' in the state's criminal history database, which meant that there were almost no matches between the two databases when Hispanic (as a race) was used as a criteria for matching, even though both databases contained many entries with the value 'Hispanic'. In short, the two datasets provided conflicting information.

A special case of data smell #7 is a dangling reference across data set, which takes place when data in one data set has no counterpart in another. For example, 'Jonathan Stray' might be listed as instructor for the 'Computational Journalism' course in a instructor/course table and at the same time not be registered as an 'instructor' (but as a 'fellow') in a table containing university's affiliates.

There is no straightforward way to handle these smells. Possible strategies are to try to decide which data point is more reliable (by picking more recent over older data, for example) or to fact check the conflicting data points (by calling up the institution responsible for publishing the data, for example.)

Data Smell #8: Entry in wrong fields

e.g., address in the name field

Data Smell #9: Wrong Spatial Data

violating spatial constraint; e.g., incomplete shape
working with GIS, for example. always plot over the map and make sure geography is correct.

Data Smell #10: Wrong derived-data field

careful when manipulating data to create new data based on original
due to error in functions for computing data in a derived field. for example, computing average Example: http://retractionwatch.com/2013/04/18/influential-reinhart-rogoff-economics-paper-suffers-database-error/

Data Smell #11: External Inconsistency

External inconsistency is, simply put, the idea that data measures need to make sense. In other words, given the context surrounding the data generation processing, the measured values need to be reasonable when compared against values outside the data set.

In other words, this is the case when the numbers don't add up. You can tell this is the case by looking at data outside the data set. For example, the number of Employees in the Employee table and the number of employees in the department table do not match, or fishing statistics for a given country add up to more than total fish in the ocean.

Again, Hispanic data.

//Find Chinese fishing article

http://articles.latimes.com/2009/jul/09/local/me-lapd-crimemap9 how did they find out? -> Ben Walsh Newsday - checked absence of legislation by doc analysis, then survey http://ire.org/accounts/login/?next=/blog/uplink/2014/03/26/database-key-proving-lax-police-misconduct-oversig/

Warning about other problems that can't be easily detected: these come down to a judgement call about the contents of the data - a task that cannot be easily automated. They're not even smells because they are not red flags, which can be easily detected.

Erroneous entry or misspelling

e.g., age mistyped as 26 instead of 25
this is very similar as above, but we can't detect it unless it falls outside expected range.

Misspelling

e.g., principle instead of principal, effect instead of affect
text editor will sometimes but not always detect these

=> How to distinguish between these and fabricated data? You can't. This is fabricated data, in a way.

We move on to look at data smells that are particular to dynamic data, that is, data that is updated over time. These smells do not apply when static data only is used, that is, when there is no need to handle updates on the data set. In the context of journalism, these smells might arise in the development of interactive interfaces that rely on frequent calls to APIs. For example, a journalist embedding tweets on a story needs to consider the possibility that the user might delete that tweet.

Data Smell #11: Lost Update

violating temporal valid time constraint; e.g., a person’s age or salary not having been updated
specify range for which data is valid

This refers to the case when more than one transaction read and update the same data simultaneously. // Example that we might run across? How else can lost updates happen?

Data Smell #12: Unrepeatable Read

Twitter: a journalist embedding tweets on a story needs to consider the possibility that the user might delete that tweet. Different versions of the data? E.g. updates on economic statistics http://pandaproject.net/ centralization and version control

####3. Unusable Data

Not wrong, but unusable, mostly due to lack of consistency

Data Smell #13: Non-Standardized Data

list types of lack of standardization

// regular expression

http://datajournalismhandbook.org/1.0/en/understanding_data_2.html: "Take “Gender”, an easy example. You may discover that your Gender field includes any of a mix of values like these: Male, Female, M, F, 1, 0, MALE, FEMALE, etc., including misspellings like ‘Femal’. To do a proper gender analysis, you must standardise — decide on M and F, perhaps — and then change all the variations to match the standards. Another common database with these kinds of problems are American campaign finance records, where the Occupation field might list “Lawyer”, “Attorney”, “Atty”, “Counsel”, “Trial Lawyer” and any of a wealth of variations and misspellings; again, the trick is to standardise the occupation titles into a shorter list of possibilities."

Labels should be consistent across data. There shouldn't be more than one possible label to describe a given entry. For example, when describing the types of products sold at a supermarket, the data set shouldn't describe one brand of pizza as "food" and another one as "frozen food".

Extraneous data added to entries usually data analysis more difficult. For example, mixing use of name and title (such as CJ. John Roberts Jr.) with the use of just name (John Roberts Jr.) in a data set about Supreme Court justices make it difficult to cluster entries about the Chief Justice.

In the FIFA data set example, one could have added the annotation "(expected)" next to Brazil, Russia, Qatar in the rows corresponding to 2014, 2018 and 2022, since these events are yet to take place, but this would make it more difficult to cluster the events of 1950 and 2014 as having been hosted by the same country - Brazil - since the labels used in each year are different ("Brazil" and "Brazil (expected)", respectively).

Likewise, units should be consistent across data. For example, a given data set shouldn't mix up metric (such as m and km) and imperial units (such as in and ft). It shouldn't even mix up different metric units such as km and m because different units undermine the task of comparing different data entries. Again, it is definitely a good idea to adopt a standard and enforce it across the entire data set in case different units are used.

If there is inconsistency in the labels used, it might be a good idea to make them consistent before analyzing data. Google Refine is one tool with which this can be done. Please see Tools section below for details.

Word of advice: even if units are consistent across the data set, be careful about unit interpretation. For example, // find article Fukushima: m sievert/h vs. not per hour

Ambiguous data

due to: Use of abbreviation (Dr. for doctor or drive)
or Incomplete context (homonyms; and Miami, of Ohio or Florida)

Unit inconsistency

Measurement units (including date, time, currency, distance, weight, area, volume...)
solution: algorithmic transformation is possible

Data Smell #14: Encoding Problems

My master's project!

Overall Lack of consistency in representation. So many cases to consider:

Overall Lack of consistency in representation

including negative number, currency, date, time, precision, fraction
abbreviations (ste for suite, hwy for highway)
Alias/nick name (e.g., Mopac, Loop 1, and Highway 1; Bill Clinton, President Clinton, William Jefferson Clinton), Abbreviated version (e.g. John Kennedy for John Fitzgerald Kennedy)
Uses of special characters (space, no space, dash, parenthesis, in a social security number or phone number)
Different orderings (John Kennedy vs. Kennedy, John)
hierarchial data: state-county-city vs. state-city, Different orderings (city-state, state-city)

// Simplify This refers to errors introduced when data sets are integrated (for example, problematic joins) or representation standards are not consistent throughout data entries.

// explain what a join is (tools section) // improper join example: race and ethnicity?

###Data Smells in Fabricated Data Fabricated data provide for juicy data journalism. If a data journalist can demonstrate that data was fabricated, she has a powerful story in her hands. The task is a difficult one, but there are a few techniques one can attempt.

Data Smell #15: Benford's Law Doesn't Hold

Benford's law, also known as the law of leading digits, dictates that the frequency distribution of leading digits follow a logarithmic distribution. As a result, about 30% of data entries in a data set should have a leading digit of '1', while '9' should be a leading digit less than 5% of the time.

Benford's law usually apply in scenarios in which quantities span several orders of magnitude (for example, from thousands to millions of thousands) and arise out of nature (for example, land area, population). You can watch Benford's law at play on a variety of data sets here.

Precise conditions under which Benford's law must apply are not yet known, which makes its use in journalism a delicate matter.

Now, it can't be stressed enough that the fact that Benford's Law doesn't hold does not necessarily mean that the data is fabricated. Necessary and sufficient conditions for Benford's Law to apply are yet to be discovered.

Example: Toxics Release Data by James Hamilton. http://www.poynter.org/latest-news/als-morning-meeting/98079/why-journalists-should-be-cautious-when-intrepreting-new-toxics-release-data/ http://www.nieman.harvard.edu/reports/article/100933/Tracking-Toxics-When-the-Data-Are-Polluted.aspx // comparative analysis

// Need more on when you would and would not expect it to hold? when should journalist try it?

Example: A few journalists have fallen into the trap of claiming a data set was fabricated based solely on the fact that Benford's Law doesn't apply. You can read one such story here.

A project at DataFest, promoted by The Brown Institute For Media Innovation at Columbia University, has attempted to investigate government accounting fraud through Benford's Law but its results were inconclusive.

You can use this online tool put together by Journalism++ to test whether or not a data set abides by Benford's Law.

Data Smell #16: Lack of Accuracy

http://en.wikipedia.org/wiki/Data_cleansing

Accuracy: The degree of conformity of a measure to a standard or a true value - see also Accuracy and precision. Accuracy is very hard to achieve through data-cleansing in the general case, because it requires accessing an external source of data that contains the true value: such "gold standard" data is often unavailable. Accuracy has been achieved in some cleansing contexts, notably customer contact data, by using external databases that match up zip codes to geographical locations (city and state), and also help verify that street addresses within these zip codes actually exist.

###Tools ####1. Frequency Table of Categorical Variables in Excel or Google Refine

// link to specific google refine tutorial // histograms equivalent to freq tables for num data // freq tables in R: factor()

####2. Joins

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensuring Accuracy in Data Journalism

Uh oh!

Clone this wiki locally