5 More Tips for Improving Data Quality

1. Consider how you use nulls (0 is different to null)

Suppose you are measuring rainfall each day but you go away for the weekend. On Saturday and Sunday there may have been some rain but you have no way of knowing. One way to record this is to put a zero against the rainfall, however, this would be wrong because we have no way of knowing that the rainfall was actually zero. Another way may be to leave the value blank. The problem with this is that we don’t know if this is missing data or has deliberately been left blank. The standard way to do this is to choose a particular value that represents a null value. This value should obviously be one that cannot reasonably be mistaken for a real measurement. For rainfall, a value of -9999 may be suitable. When using null values, it is especially important to ensure that the value representing null is well defined. The best place for that definition is in the data dictionary.

2. Beware of duplicates

There are many ways in which we may get duplicates in our data. There may be human error in entering results or measurements. We may have collected our data from different sources and there may be overlaps in the data.

Sometimes duplicates may be easy to identify (and programming tools can be of great assistance) and other duplicates may be more difficult to find. Suppose, for example, that we have a dataset that consists of place names. Often there are small variations in the spelling of place names. Street addresses are particularly hard. Is it Main St., or Main Street?

3. Consider how to represent spatial data

There are a number of ways of representing spatial data (in particular, points on the surface of the earth). One of the most common ways is by referencing the latitude and longitude of the point. It is a common misconception that a latitude/longitude uniquely references a single point on the surface of the earth. I won’t go into a lot of detail here, but when we use a latitude/longitude, we may be assuming the earth is a perfect sphere which it is not. Because it’s not perfect, there are a number of different assumptions we can make about the shape of the earth. These are catalogued into what are called Spatial Reference Systems (SRS). So, when using a latitude/longitude in our data, we should also be careful to say which SRS we are assuming. A very common one is EPSG:4326, also known as WGS84.

Another thing about latitudes and longitudes is that there are two common ways of expressing them as numbers. One method is to use a decimal and the other is to use minutes and seconds. I would suggest always using a decimal format. It is more common in the scientific community, allows arbitrary accuracy and makes calculations simpler.

Finally, you have to decide how to represent positions south of the equator. The latitude of Melbourne, Australia, for example, could be represented as either 37.8136 S, or as - 37.8136. I would suggest the latter, as once again this makes calculations simpler. What is more important, though, is that once you have chosen a representation, make it clear and be consistent.

4. Sanity check your data (and document the sanity checks)

A sanity check is looking for data that is so far out of range there is an obvious mistake. A sanity check for rainfall may look for negative values. A sanity check for number of siblings also could not be negative, and a value of 30 may be considered a reasonable uppper bound.

5. Automate all your quality checks

If a dataset consists of a few dozen rows, it may be feasible to check each row by hand. But as soon as it grows to more than a hundred or so rows, then this is no longer feasible, especially if it is frequently changing. With some very simple programming, it is possible to create automated quality checks (this will be the topic of an upcoming blog post).

Over time more quality issues may be discovered in the dataset. It is important that as soon as these issues are discovered, the automated checks are updated so that the new errors will be discovered.

Finally, it is good practice to save each version of the quality checking code and reference it together with the dataset so that anybody using the dataset can be confident of quality of the data.