-
Notifications
You must be signed in to change notification settings - Fork 1
Ensuring Accuracy in Data Journalism
#Data Smells
By Nikolas Iubel
It is often said of data journalists that they interview data just like traditional reporters interview traditional sources. A corollary is that data must be subject to the same scrutiny traditional sources are. Not all sources can be readily trusted and neither can all data sets. But how do data journalists go about determining whether or not a data set demands closer inspection?
Inspired by the idea of code smells, we take on the task of developing a list of "data smells", that is, red flags in data sets that may indicate data is not be reliable for journalistic purposes. The term "data smells" was suggested by The New York Times' Jacob Harris at the 2014 NICAR/IRE conference in Baltimore. This is not by any means an exhaustive list; feedback and contributions are welcome.
###Data Smells in Dirty Data
When data is unreliable or untrustworthy, we say data is dirty. Now, what exactly makes data dirty?
"Broadly, dirty data include missing data, wrong data, and non-standard representations of the same data. We say that data is dirty if the user or application ends up with a wrong result or is not able to derive a result due to certain inherent problems with the data." (Kim et. al., 2001)
At this point, we make a distinction between two types of wrong data: unintentionally inaccurate data and doctored or fabricated data. For now, we will take "wrong" in the context of dirty data to mean "unintentionally inaccurate". We treat fabricated data as a separate case, which we will look at later. It is trickier to detect fabricated data than dirty data.
Kim et. al published a taxonomy of dirty data on the Data Mining and Knowledge Discovery journal. In order to study dirty data in the context of journalism, we'll adopt their taxonomy and use it as a baseline to list different types of data smells and provide journalistic examples to illustrate each type. For the purposes of this project, we will focus on numerical and string data, (that is, we exclude multimedia data such as audio and video because these require further considerations.)
Based on Kim's taxonomy, we divide data smells in three broad categories:
- missing data
- wrong data
- unusable data
We'll look at each case closely, but before we do so, it is worth going over a concept that is fundamental to our discussion: frequency table.
A frequency table of a given variable is simply a table that displays the frequency of each value of the variable of the data, i.e. a table that shows how many times each value occurs. Sometimes it also displays the percentage frequency of each value.
For example, consider the following data set of FIFA World Cup editions and hosts:
| YEAR | HOST |
|---|---|
| 1930 | Uruguay |
| 1934 | Italy |
| 1938 | France |
| 1950 | Brazil |
| 1954 | Switzerland |
| 1958 | Sweden |
| 1962 | Chile |
| 1966 | England |
| 1970 | Mexico |
| 1974 | West Germany |
| 1978 | Argentina |
| 1982 | Spain |
| 1986 | Mexico |
| 1990 | Italy |
| 1994 | United States |
| 1998 | France |
| 2002 | South Korea and Japan |
| 2006 | Germany |
| 2010 | South Africa |
| 2014 | Brazil |
| 2018 | Russia |
| 2022 | Qatar |
A frequency table of the variable 'Host' simply shows a count of how many times each country listed has hosted the FIFA World Cup plus the percentage frequency of each country as host. The frequency table is as follows:
| HOST | Frequency | Frequency(%) |
|---|---|---|
| Argentina | 1 | 4.545% |
| Brazil | 2 | 9.0925% |
| Chile | 1 | 4.545% |
| England | 1 | 4.545% |
| France | 2 | 9.0925% |
| Germany | 1 | 4.545% |
| Italy | 2 | 9.0925% |
| Mexico | 2 | 9.0925% |
| Qatar | 1 | 4.545% |
| Russia | 1 | 4.545% |
| South Africa | 1 | 4.545% |
| South Korea and Japan | 1 | 4.545% |
| Spain | 1 | 4.545% |
| Sweden | 1 | 4.545% |
| Switzerland | 1 | 4.545% |
| United States | 1 | 4.545% |
| Uruguay | 1 | 4.545% |
| West Germany | 1 | 4.545% |
Frequency tables can be done in Excel or Google Refine. For more on frequency tables and how to construct them, please see the Tutorials section below.
We now move on to examine the three broad categories of data smells: missing data, wrong data and unusable data.
####1. Missing Data Oftentimes data sets contain missing entries. This might be because the specific data point is unknown, for instance. Now, if missing data is allowed, all is fine, but if missing data is not allowed or not expected, data is considered to be dirty.
Data Smell #1: Blank Cells in a Spreadsheet
The simplest case of missing data is the mere presence of blank cells in a spreadsheet. For instance, imagine a data set of campaign donations in which a given entry contains the donation amount but not the name (or some identifier) of the donor.
The fact that a data set is incomplete when it shouldn't be will often become a story in itself, such as the following story published in the UK: London councils lack data on children missing from care, finds report. London Boroughs are legally responsive for a number of children in terms of care and protection but some boroughs lack data on the number of children missing from care. The incomplete data set is available here.
For a small data set, detecting the presence of blank cells is as simple as opening the document and visually inspecting it in search for blank cells. The larger the data set is, the more difficult it is to visually detect blank cells. Fortunately, frequency tables come to the rescue: this task can be automated in Excel or Google Refine. One simply needs to check for the frequency of 'blank' values. If 'blank' occurs at all in the data set, it might be worth it investigating why that is. The question of whether or not data is allowed or expected to be missing often comes down to a judgement call and may require further reporting to be answered.
Note that there is a distinction between 'blank' and 'empty' values. When checking for blank entries, be careful with whitespace and tab characters, which look 'blank' but are not 'empty'. The data set at hand might contain entries that look blank but are actually filled with whitespace and tab characters. This scenario might occur when the protocol for manually entering data into a system does not allow for missing values. When faced with missing data, the human inputting data into the system enters an invisible character such as tab or whitespace to bypass this constraint. No search function in Excel, Google Refine (or really any programming language) will classify these entries as empty because they actually contain a character (albeit an invisible one). In short, make sure to look at the frequency of whitespaces and tab characters as well as the frequency of 'empty' cells.
Data Smell #2: Gaps in Data
A slightly more evolved case of missing data is when the data set does not contain 'blank' entries but data is missing anyway, i.e., there is a gap in the data.
For example, a data set of illiteracy rates across the U.S. might contain data for only 49 states when there should be data for all 50 states. If the 'states' column lists all 50 states but the 'rates' table contains only 49 values, then this is data smell #1: an instance of blank cells in a spreadsheet. However, if the 'states' and the 'rates' column both contain 49 values, it wouldn't be evident from the analysis of the frequency table that data is missing.
Back to the FIFA World Cup example, note that the table showing the host country for each edition has an entry corresponding to every four years since the world cup's first edition in 1930, except for the years of '1942' and '1946'. This is an example of gap in data. Further investigation shows that the FIFA World Cup did not take place in 1942 and 1946 due to World War II. This example comes to show that a gap in data is not necessarily a mistake, but it is something that should nonetheless be always looked into.
Besides constructing a frequency table, one can detect this smell by comparing the data set against another data set of the same scope ('national', in the example above) which is known to be complete. This can be done by joining the two data sets.
Time series plots are particularly useful to detect gaps in temporal date. The following story by the Wall Street Journal is based on a data set that contains gaps: Five Revelations From the U.S. Embassy’s Beijing Pollution Data Dump. There's a footnote on the graph alerting to the fact that no data is available for Nov. 7, 2008 - Feb. 16, 2009. Note how the lack of data for these dates stands out in the time series plot.
=========================
It might be the case that the remainder of the data set is accurate, in which case we don't want a few instances of missing data to keep us from telling a story. There are a couple of possible ways to handle the lack of information for a given entry:
-
Simply ignore incomplete rows.
-
Perform interpolation to approximate value of missing data.
-
Call up the institution responsible for the data set and ask for missing information or reason why it's missing.
Be aware that these might not apply in a given situation. In the case of data on missing children in London, for example, it wouldn't make sense to approximate data for one neighborhood by interpolating data from others, since an exact count of children is the goal.
It did make to sense to interpolate data in the following story published by The Guardian's Data Blog: Inequality 'worst since second world war'. In this case, missing data on the income share by portion of the population have been estimated by interpolation of neighboring values because these data rarely undergo sharp variations from one year to the next.
####2. Wrong Data The following is a fairly comprehensive list of instances where data might be wrong. As a reminder, we take wrong data to mean unintentionally inaccurate (as opposed to doctored.)
Data Smell #3: Wrong Type
A column might contain values of one data type when a different type is expected. For example, the 'age' value for some data entry might be a string (such as 'abc') when the values are expected to be integers (such as '12' or '3').
In broad terms, data types are either of type numeric (such as integers) or type string, as The Tow Center For Digital Journalism explains in a handy blog post on Code Vocabulary. Poynter has a nice tutorial on how journalists can use Excel to organize data which revolves around this idea of not mixing up different data types.
Frequency tables will readily pick out data type inconsistencies; all you need to do is search the table for values that are not of the expected type. In our FIFA example data set, for instance, the two rows below would fit the bill. In the first row, attribute 'year' has a value of type string but a numeric value is expected. In the second row, attribute 'row' has a value of type numeric but a string value is expected.
| YEAR | HOST |
|---|---|
| ABCD | Uruguay |
| 1934 | 99999 |
Data Smell #4: Outliers
Outliers are data points that significantly deviate in some way from the expected or observed trend of the entire data set. The data smell related to outliers refers to the case when data values fall outside the expected range. An example would be a data point with a value of '225' for a person's age, which one might expect to range from '0' to about '115'. These are usually due to errors in the data entering process. An example of outliers for string data would be a value of 'D' for a person's gender, when 'M' or 'F' are the expected values.
Outliers occur due to either an error in data collection or an interesting, perhaps newsworthy event. For instance, '225' in the example above can very well result from a misspelling of '25'. Likewise, data value 'D' for gender could be a typing mistake, seeing as letters 'D' and 'F' are side by side on the keyboard. An example of an interesting outlier event is this story by The Chicago Tribune, which shows that a small group of healthcare professionals billed the federal government at the highest rate for more than 70% of their Medicare visits, deviating significantly from the overall trend.
A special case of outliers takes place when entry values are entered in the wrong field. For example, one might see a person's address listed in the 'Phone' column when it should be listed under 'Address'. Because the misplaced value does not belong in that column, it deviates significantly from all the other values in the column, which makes it an outlier.
Specifying the expected range of data values is often a difficult task which requires further reporting. As a result, it is often difficult to determine whether outliers are erroneous data. Possible remedies are ignoring entries whose values fall outside the expected range or contacting the publisher of the data set to ask for clarification and/or correction.
Whether outliers stem from errors or interesting events, they merit close inspection. If there is an error in data, it is paramount to remedy the error before going further with the data analysis. On the other hand, if the data point has been accurately measured, its significant deviation from the overall data trend will often be a story worth reporting on.
Unsurprisingly, identifying outliers and reporting on their causes or consequences is a classic approach to data journalism. An outstanding piece of journalism that resulted from the investigation of outliers in a data set is this story by The Palm Beach Post's Lawrence Mower, which uncovered a lottery fraud scheme. Mower analyzed a data set of Florida's lottery winners from 1997 to 2013 and observed that some store clerks and owners had cashed in hundreds or even thousands of high-value winnings tickets, while the vast majority of winners had claimed a prize only once. The reporter eventually found out that store clerks and owners would tell winning customers that they had not won and would then cash the prize on their behalf. This is a great example of data set outliers leading to the uncover of serious wrongdoing. The Columbia Journalism Review published an article dissecting Mower's reporting process.
Frequency tables are again useful to detect outliers because they list all values that occur in a given column and as a side bonus, provide the range of values for each column. One might detect these smells by examining simple statistics such as maximum, minimum, quartiles and percentiles for each column. Please see the Tutorials section below for more on how to use these functions in Excel.
Data visualizations such as histograms or scatter plot will also allow you to detect outliers. The graph below plots gun ownership (x-axis) versus homicide rates (y-axis) by country. Data was published in 2012 by The Guardian Data Blog. Note how Honduras and The United States are clear outliers in the plot: the former has a much higher homicide rate than all other countries while the latter has a much higher gun ownership rate than all other countries.

Data Smell #5: Sharp Curves
Abrupt shifts in temporal or spatial data are usually an interesting finding which calls for close examination. Just like outliers, these shifts might either be a story worth reporting on or the result of some error in data collection. Either way, they are worth investigating. The best way to spot sharp curves in the case of temporal data is by constructing Edward Tufte's slope graph or a time series plot. In the case of spatial data, color-coded maps are the way to go.
This article by The Wire on the 2014 press freedom ranking by country provides examples of both cases. In terms of temporal data, the sharp drop seen by the U.S., which fell 14 spots from 2013 to 2014, did not go unnoticed upon the release of the 2014 ranking. By comparing the rankings over time (i.e. in a time series,) the reporter was able to spot the sharp fall observed by the U.S. in the ranking.
As for spatial data, note how the color-coded map illustrating the article makes it easy to see that Namibia stands out positively among its neighboring countries while Sudan stands out negatively in Africa. The contrast between the color codes is a visual representation of the sharp rises and drops between the indexes of neighboring countries.
===========================================
The following smells are internal consistency checks. Internal consistency is the idea that no two data points within a data set nor two data sets within a database contradict each other. The smells below refer to instances of violation of this internal consistency standard.
Data Smell #6: Conflicting Information Within a Data Set
This data smell arises when cross-field validation is not observed. Cross-field validation refers to the idea that conditions based on multiple data points must hold. For example, one such condition in the FIFA data set example all percentage frequencies in the frequency table must add up to 100%. Another example would be the idea that the start date of some event must be earlier than the end date of the same event. Any end date earlier than the start date for the same event would raise a red flag.
A common instance of conflicting information within data sets occurs when there are repeated values for an attribute which expects unique values only. For example, social security numbers (SSN) are expected to be unique and so the occurrence of repeated SSN's in a data set would be a data smell. NBC News published a story on the data analysis of SSN records done by an analytics company which shows that as many as 40 million SSN's are attached to more than one person in the U.S.
The difficulty in checking for this smell is coming up with a list of all cross-field validation conditions which are expected to be met.
Data Smell #7: Conflicting Information Across Multiple Data Sets
Much of data journalism is about comparing data sets and drawing interesting conclusions from the comparison. However, it might be the case that two data sets contain contradictory information, in which case the reporter must closely examine the contradiction.
For example, a reporter might find that a company's total number of employees (listed, say, in a directory of employees) is lower than the sum of the number of employees by department. Perhaps this is because some employees work in multiple departments, the reporter might find after some investigation. Perhaps there is an error in some data entry. Either way, this is a red flag warning the reporter about the need for further investigation before the actual data analysis.
The Tampa Bay Times' Matt Waite wrote an informative piece titled Handling Data About Race and Ethnicity on Source describing his misfortune in ignoring this data smell. Waite explains that 'Hispanic' was recorded as a race in Florida's voter registration database and as 'ethnicity' in the state's criminal history database, which meant that there were almost no matches between the two databases when Hispanic (as a race) was used as a criteria for matching, even though both databases contained many entries with the value 'Hispanic'. In short, the two datasets provided conflicting information.
A special case of this data smell is a dangling reference across data set, which takes place when data in one data set has no counterpart in another. For example, 'Jonathan Stray' might be listed as instructor for the 'Computational Journalism' course in a instructor/course table and at the same time not be registered as an 'instructor' (but as a 'fellow') in a table containing university's affiliates.
There is no straightforward way to handle these smells. Possible strategies are to try to decide which data point is more reliable (by picking more recent over older data, for example) or to fact check the conflicting data points (by calling up the institution responsible for publishing the data, for example.)
Data Smell #8: Wrong Derived-Data Field
Once in possession of a trusted data set, a natural first step for data journalists is to start manipulating it, which often involves creating new columns based on the original ones. Frequency tables are a good example of data that can be derived from a dataset. However, this process of deriving new data from the original dataset might introduce errors not initially present in the data.
For instance, we can expand our initial FIFA data set by adding a third column indicating whether the host country was hosting the event for the first or the second time. Note that this information does not require external knowledge; rather, it can be derived from the original data set. An erroneous version of the expanded data set is the following:
| YEAR | HOST | 1st or 2nd TIME HOSTING |
|---|---|---|
| 1930 | Uruguay | 1 |
| 1934 | Italy | 1 |
| 1938 | France | 1 |
| 1950 | Brazil | 1 |
| 1954 | Switzerland | 1 |
| 1958 | Sweden | 1 |
| 1962 | Chile | 1 |
| 1966 | England | 1 |
| 1970 | Mexico | 1 |
| 1974 | West Germany | 1 |
| 1978 | Argentina | 1 |
| 1982 | Spain | 1 |
| 1986 | Mexico | 2 |
| 1990 | Italy | 2 |
| 1994 | United States | 1 |
| 1998 | France | 2 |
| 2002 | South Korea and Japan | 1 |
| 2006 | Germany | 1 |
| 2010 | South Africa | 1 |
| 2014 | Brazil | 1 |
| 2018 | Russia | 1 |
| 2022 | Qatar | 1 |
Note that the third column in the row corresponding to '2014' says that Brazil is hosting the World Cup for the first time in 2014. This is not true though, since Brazil also hosted the World Cup in 1950. By expanding on the initial data set, we have introduced an error not originally present in the data. This is something data journalists must always watch out for.
Retraction Watch posted an article that describes how two influential Harvard economists who incurred in this mistake. When analyzing data on public debt and GDP among 20 advanced economies, the scholars mistakenly excluded a handful of critical data points present in the original data set from their analysis.
In order to detect wrong derived-data values, it is worth it revisiting the original data set and recalculating the derived values.
The process of converting data sets into a different format is particularly conducive to introducing errors not originally present in the data set. One example of such errors are character encoding issues, which consist of undesired changes in character representation across the data set.
==============================================
We now move on to external consistency checks. External consistency is, simply put, the idea that data measures need to make sense in the real world. Given the context surrounding the data, the measured values need to be reasonable when compared against values outside the data set.
Data Smell #9: External Inconsistency
This is the case when data doesn't hold up, that is, it doesn't pass a reality check. Reality check in this context means comparing the data set against values outside of it. For example, a journalist might find that the fishing statistics report for a given country is problematic because the amount of fish claimed to have been caught adds up to more than the total number of fish in the ocean.
Matt Waite's article Handling Data About Race and Ethnicity describes how The Sarasota Herald-Tribune used this data smell to show that a felon purge list released by the state of Florida was flawed. In essence, the Herald-Tribune noticed that only 61 out of 47,000 felons were listed as Hispanic while 17% of Florida's population claims Hispanic as their race. The data simply didn't hold up, it didn't make sense. This smell led the Sarasota Harold-Tribune to investigate further and eventually determine that Hispanic was classified as a race in the state's voter registration database and as an ethnicity in the state's criminal history database.
Another good story that came out of external consistency checks for data is The Los Angeles Times' LAPD's public database omits nearly 40% of this year's crimes. The paper discovered that the Los Angeles Police Department's online crime map was missing nearly 40% of reported crimes by comparing the online crime map against the department's official crime tally.
Data Smell #10: Wrong Spatial Data
This is a domain-specific instance of external inconsistency that comes up when we're dealing with spatial data, in particular shapefiles. Shapefiles are data sets that specify geographical coordinates (e.g., state borders in the U.S.) based on points, lines and polygons. These files are the backbone of Geographic Information Systems (GIS) tools but can also be used with statistical packages popular among data journalists such as R.
One needs to watch out for the case when shapefiles are incomplete or inaccurate. For example, a journalist who would like to map crime rates in all precincts across New York City might find that the shapefile at hand plots the borough of Manhattan only.
When working with shapefiles, it is good practice to visually check the geographical integrity of the data. To do so, plot the shapefile over the map output by a 'golden standard' shapefile and check whether the shapes line up properly.
====================================================
Data Smell #11: Unrepeatable Read
We now look at a data smell that is particular to dynamic data, that is, data that is updated over time. Unrepeatable read refers to the impossibility of re-reading a data point if the point is deleted between reading attempts. For example, a journalist embedding tweets on Storify needs to consider the possibility that the user might delete that tweet. Whenever someone clicks on the deleted tweet's link to the original tweet, Twitter will say the page no longer exists. However, the tweet will remain in the Storify page after its deletion; the journalist needs to make a judgement call on whether or not this might be a problem.
This smell is not a concern when static data only is used, that is, when there is no need to handle updates on the data set. In the context of journalism, these smells might arise in the development of interactive interfaces that rely on frequent calls to APIs.
The Panda Project, an online platform that allows for easy centralization and version control of data sets, is a useful tool to manage frequently updated data.
####3. Unusable Data
Unusable data is data that is not wrong but is nevertheless unusable due to lack of consistency in representations of data points across the data set.
Data Smell #12: Non-Standardized Data
It is paramount that data points in a data set are recorded consistently, that is, follow the same pattern or standard. In the Data Journalism Handbook, Steve Doig elegantly illustrates this smell:
"Take “Gender”, an easy example. You may discover that your Gender field includes any of a mix of values like these: Male, Female, M, F, 1, 0, MALE, FEMALE, etc., including misspellings like ‘Femal’. To do a proper gender analysis, you must standardise — decide on M and F, perhaps — and then change all the variations to match the standards. Another common database with these kinds of problems are American campaign finance records, where the Occupation field might list “Lawyer”, “Attorney”, “Atty”, “Counsel”, “Trial Lawyer” and any of a wealth of variations and misspellings; again, the trick is to standardise the occupation titles into a shorter list of possibilities."
The causes for inconsistency in data set are varied. The most common are the following:
- Abbreviations
Mixing up abbreviations (e.g., ste for suite, hwy for highway) and spelled-out words can be a source of inconsistency in data sets. The same applies for abbreviated names. For example, a data set shouldn't mix up full names (e.g., 'John Fitzgerald Kennedy') and abbreviated names (e.g., 'John Kennedy'). A standard should have been set in favor of either full or abbreviated names across all data points when the data set was created. If that wasn't the case, then the journalist must perform such standardization before proceeding to data analysis.
- Aliases
Aliases can also be a source of confusion. For example, 'Bill Clinton', 'President Clinton' and 'William Jefferson Clinton' are different strings that refer to the same person; 'Mopac', 'Loop 1' and 'Highway 1' all refer to the same road. By the same token, different orderings of the same name (e.g., 'Barack Obama' and 'Obama, Barack') may lead to misinterpretation. Again, it it important to ensure a unique standard is used across the entire data set.
- Ambiguous Data
Data points can be ambiguous if the context surrounding them is not completely known. For example, the abbreviation 'Dr.' is used both for the words doctor and drive; context will inform which interpretation is appropriate. Likewise, a value of 'Paris' for the attribute 'city' can be ambiguous if we have no way of knowing whether this is a reference to Paris, Texas or Paris, France.
- Extraneous Data
Extraneous data are details such as title, name prefixes and suffixes which may cause inconsistency when added to entries. For example, mixing use of name and title (such as CJ. John Roberts Jr.) with the use of just name (John Roberts Jr.) in a data set about Supreme Court justices makes it difficult to group all entries referring the Chief Justice.
- Label Inconsistency
Labels should be consistent across data. There shouldn't be more than one possible label to describe a given entry. For example, when describing the types of products sold at a supermarket, the data set shouldn't describe one brand of pizza as "food" and another one as "frozen food".
Hierarchical date are particularly prone to this kind of inconsistency. For example, the very same location might be labeled as 'Boston, MA', "Massachusetts', 'New England' or 'USA', depending on the level of granularity adopted. It is important to ensure the same level of granularity is consistently adopted across all points.
In the FIFA data set example, one could have added the annotation "(expected)" next to Brazil, Russia, Qatar in the rows corresponding to 2014, 2018 and 2022, since these events are yet to take place, but this would make it more difficult to cluster the events of 1950 and 2014 as having been hosted by the same country - Brazil - since the labels used in each year are different ("Brazil" and "Brazil (expected)", respectively).
If there is inconsistency in the labels used, it is definitely a good idea to standardize them before analyzing data. Google Refine is one tool with which this can be done. Please see the Tutorials section below for a link to details on how to use Google Refine.
- Special Characters
This refers to the lack of consistency in the use of special characters such as dashes, parentheses and spaces. For example, a phone number might be recorded as '(212) 555-9999' or '2125559999'. This is usually a problem with user-inputed data in systems that do not enforce any particular format.
- Unit Inconsistency
Likewise, measurement units (including date, time, currency, distance, weight, area, volume, etc.) should be consistent across data. For example, a given data set shouldn't mix up metric (such as m and km) and imperial units (such as in and ft). It shouldn't even mix up different metric units such as km and m because different units undermine the task of comparing different data entries. Date representation is another frequent cause of inconsistency in data, as various formats can be adopted (e.g., Month/Day/Year, Day/Month/Year). Again, it is definitely a good idea to adopt a standard and enforce it across the entire data set in case different units are used.
================
Word of warning: Apart from the smells listed above, data sets might contain other problems that cannot be easily detected. Detection comes down to a judgement call about the contents of the data - a task that cannot be easily automated. Two examples are listed below:
Erroneous entry: If the value for 'age' is mistyped as 26 instead of the correct value 25, it is no easy task to detect this error in data. The erroneous value looks correct, since it does not falls outside expected range nor does it raise any other red flags that would alert us about the need for further inspection.
Misspelling: Misspellings such as 'principle' instead of 'principal' or 'effect' instead of 'affect' are somewhat difficult to detect because the incorrect spelling corresponds to an existing word in English. Context can be used to determine these errors but again, this task is not easily automated.
================
###Data Smells in Fabricated Data Fabricated data provide for juicy data journalism. If a data journalist can demonstrate that data was fabricated, she has a powerful story in her hands. The task is a very difficult one because it ultimately means that an entire data set is unreliable (as opposed to a some entries, which was the scope of our smells so far.) The smell below refers simply to a possible sign of fraud which calls for further inspection. In the journalism context, this means traditional investigative reporting.
Data Smell #13: Benford's Law Doesn't Hold
Benford's law, also known as the law of leading digits, dictates that the frequency distribution of leading digits follow a logarithmic distribution. As a result, about 30% of data entries in a data set should have a leading digit of '1', while '9' should be a leading digit less than 5% of the time.
Benford's law usually apply in scenarios in which quantities span several orders of magnitude (for example, from thousands to millions of thousands) and arise out of nature (for example, land area, population). You can watch Benford's law at play on a variety of data sets here.
Precise conditions under which Benford's law must apply are not yet known, which makes its use in journalism a delicate matter.
Now, it can't be stressed enough that the fact that Benford's Law doesn't hold does not necessarily mean that the data is fabricated. Necessary and sufficient conditions for Benford's Law to apply are yet to be discovered. Be careful not to fall into the trap of claiming a data set was fabricated based solely on the fact that Benford's Law doesn't apply.
There have been a few attempts to use Benford's law in journalism. For instance, James Hamilton's article Tracking Toxics When the Data Are Polluted describes how he compared two data sets about pollution levels by observing that Benford's law holds for independently measured data but not for self-reported data.
In an blog past titled Statistical Analysis and Visualization of the Drug War in Mexico, Diego Valle-Jones argues that the fact that homicide data provided by the statistics agency of the Mexican government does not follow Benford's law prompted him to further investigate the data set.
A project at DataFest, promoted by The Brown Institute For Media Innovation at Columbia University, has attempted to investigate government accounting fraud through Benford's Law but its results were inconclusive.
You can use this online tool put together by Journalism++ to test whether or not a data set abides by Benford's Law.
###Tutorials
- Frequency Tables By Hand
- Frequency Tables in Excel
- Frequency Tables in R
- Finding Max and Min in Excel
- Introduction to Google Refine