Data Quality for Researchers

I have spent a lot of time both producing and using datasets and in this article I will write what I think are the five things that will make the most difference in terms of improving the quality of a dataset. I am not making too many assumptions about the type of data, we could be counting dogs in a dog park or measuring the temperature of the ocean (or running a survey about peoples’ favourite flavour of ice cream).

Use a consistent format for dates

This is the one thing that is really easy to get right and really annoying when people don’t get it right. Is the date ‘06/04/18’ the fourth of June or the sixth of April or even the 18th of April, 2006. Or the 16th of April 1996? All these dates are possible. However, there is a convention (or standard) that removes all that ambiguity. It is called ISO 8601. For dates, we represent them as follows: YYYY-MM-DD where YYYY is the year, MM is the two-digit month and DD is the two digit day. So, Christmas day, 2016 would be written as: 2016-12-25.

ISO 8601 also allows us to specify a month of a particular year (YYYY-MM) and times.

You can read more about it in Wikipedia

If you’re not yet convinced, one of the big advantages is that it is really easy to read dates in ISO 8601 format using many different programming libraries.

Use a standard naming convention for items

Most domains have standard sets of names for identifying items.

So, suppose you’re doing a survey of birds seen in backyards in Melbourne, rather than recording ‘Yellow-tailed Black-Cockatoo’ or ‘Crested Pigeon’, it may be better to record those observations as ‘Calyptorhynchus funereus’ and ‘Ocyphaps lophotes’. This will avoid similar errors with names that look similar (e.g. Yellow-tailed Black-Cockatoo vs Yellow-tailed Black-Cuckatoo or Yellow tailed black cockatoo). Notice that it is also important to be consistent with capitalisation and hyphenation.

Many disciplines have what is called a Controlled vocabulary. This will give an even better level of consistency. You can read more about controlled vocabularies in Wikipedia.

If using a spreadsheet such as Excel, it is really good practice to use a dropdown list for item names. You should not allow users (you) to manually edit an item - if it is not in the dropdown list, it must be added to a separate list of valid items that can appear in the list

Create an identifier for each record that helps maintain uniqueness of each record.

Many types of observation will have a natural way of uniquely identifying each item. For example, suppose you’re maintaining a daily log of your weight, then the date would serve as a unique identifier. However, if there is no natural identifier, then it makes sense to create one. Having an identifier means that records will maintain their identity across time and across databases. Consider for example, if we’re running a trial about how labradors put on weight. We have a labrador called Rufus. After a few months Rufus drops out of the trial, but then later there is another Rufus who joins the trial. Without a unique identifier, we would not be able to distinguish between the two Rufuses. Also, as our dataset gets bigger, we may decide to start using a database rather than a file to store our data. Having a unique index will allow us to cross reference the entries in our database with the original data.

Don’t update the dataset. Make a fresh copy then update the copy and archive the original.

As data changes it is really important to be able to track those changes through time. There are specialised software systems that allow you to track this, but these all have a learning curve and for many purposes (e.g. a small dataset that does not change a lot over time) it is probably sufficient to just make backups. You can create a naming system such as dataset.csv.20180103 for each backup. You should also have a special directory where you keep your backups and a log. The log could consist of entries such as this:

20180103 Removed record 161 as it was a duplicate of record 66.
20180101 Removed record 33 as Rufus was not a labrador.

The first column, refers to the extension of the filename that we modified.

Prefer a text based format

While a spreadsheet such as Excel may be really useful for entering, editing and doing simple analysis on data, the format in which it saves files is not really portable across operating systems and time.

By saving files in a text format such as CSV, it allows the data to be used in far more environments. It also allows for easy editing of data using a text editor and other operating system tools.

(Bonus) Select good names for the column headers

Good names do not only make the data set appear better kept and more professional, they help reuse the data and prevent mistakes.

Suppose you’re looking at a dataset which consisted of results of interviews with pregnant women. There are columns named d1, d2, d3 and d4 which all contain numbers in date format. Obviously these columns are not very informative. A better scheme may be date_interview_first, date_interview_next, date_last_interview and date_interview_last. This is better and more descriptive. But is date_last_interview the most recent interview or is it the final interview? By changing the names to date_interview_final and date_interview_previous we make it more clear.

The second part of this is to maintain what is called a data dictionary. This describes what is held in each column and the values that it can take on. For really high quality data, we can go one step further and describe the constraints.

For example:

date_last_interview: This is the date in ISO 8601 format ('YYYY-MM-DD')
on which the last full interview was held.   If the interview was
postponed or cancelled the date will be left blank.

Another example:

sex: This will be one of M, F, N or U with the following meaning:

M - Answered male
F - Answered female
N - Answered non-binary
U - Declined to answer