Introduction to AWK

Overview

AWK is a gem of a language. It is small, simple enough to learn in about an hour and many programs are so simple that they can be written in a line of code. AWK is included in most distributions of Linux.

My next blog post is going to be about ensuring data quality with AWK, so I thought I would write this post as an introduction to the language.

The data

Most of the examples will use a “fictitious” dataset that consists of a time series that represents daily rainfall and temperature data. The following is an example:

1910,1,6,0.0,37.0,22.5
1910,1,7,4.1,36.3,21.6
1910,1,8,18.8,34.4,17.8
1910,1,9,0.0,-99.9,-99.9
1910,1,10,0.0,-99.9,-99.9
1910,1,11,11.4,17.1,16.2
1910,1,12,21.1,20.7,15.4
1910,1,13,0.3,22.9,16.2
1910,1,14,7.9,23.2,16.2
1910,1,15,15.5,28.1,17.5
1910,1,16,0.0,28.9,-99.9
1910,1,17,0.0,23.7,-99.9

The first column is the year followed by the month and day. The last three columns are rainfall (in mm), maximum temperature and minimum temperature (in degrees Celsius). Missing values are represented by the value -99.9.

You can download the full data set by visiting this link:

Download data

Structure of an AWK program

An AWK program consists of a list of patterns and actions. AWK will go through the input line by line and try to match each pattern against the line (I will soon explain what is meant by a pattern). If the pattern matches, it will execute the action.

So, a template for an AWK program looks like this:

pattern1 {action1}
pattern2 {action2}
pattern3 {action3}
...

The second thing about AWK is that for each line (record) that it reads, it splits the line into fields. By default it will consider a space the field delimiter, but we can specify which character we want to use. The first field will be referred to as $1, the second field as $2 and so on. We can also refer to the whole record as $0.

Our first AWK program

We will use the file acorn01.csv as our input. In our program we will print out all records for the year 1911. In other words, we want to print all records where $1 equals 1911. The pattern we will use is $1 == 1911 and the action will be print. Since this is such a simple program, we can write it all on the command line as follows:

$ awk -F, '$1 == 1911 {print}' acorn01.csv

We read this as follows:

awk - Invoke the program awk.

-F, - Use a comma (,) as the field separator.

$1 == 1911 - This is our pattern. When it matches, execute the action.

{print} - Print the line. We could also have used print $0.

acorn01.csv - This is our input file.

When we run the program we get the following output:

1911,1,1,0.0,28.9,-99.9
1911,1,2,0.0,32.3,-99.9
1911,1,3,0.0,36.6,8.8
...
1911,12,29,0.0,-99.9,9.1
1911,12,30,0.0,36.0,20.5
1911,12,31,0.0,28.6,20.1

Of course I have left out a lot of the output. We can, however, use another Unix utility to count the lines of output. wc counts lines. By using wc -l we tell wc to count only lines. We redirect the output of awk to wc.

$ awk -F, '$1 == 1911 {print}' acorn01.csv | wc -l

And this gives us:

which is what we expect.

Suppose we only want to see the values for January, 1911, we could make our pattern more specific. We would type:

$ awk -F, '$1 == 1911 && $2 == 1 {print}' acorn01.csv

The && is the way that we specify that we want to meet the condition that the first field equals 1911 and the second field equals 1 (January).

When running the above command, we see:

1911,1,1,0.0,28.9,-99.9
1911,1,2,0.0,32.3,-99.9
1911,1,3,0.0,36.6,8.8
...
1911,1,29,0.0,27.9,14.7
1911,1,30,0.0,29.4,17.1
1911,1,31,2.8,31.3,16.2

If we don’t specify any action, the default action is to print a line when a pattern is matched. This means that we can rewrite the above program as:

$ awk -F, '$1 == 1911 && $2 == 1' acorn01.csv

Now suppose we want to print the rainfall in 1911, but are only interested in days when the rainfall is above 20mm. We could do it as follows:

awk -F, '$1 == 1911 && $4 > 20' acorn01.csv

We can improve the output. First, we will print only the date and the rainfall:

$ awk -F, '$1 == 1911 && $4 > 20 {print $1, $2, $3, $4}' acorn01.csv

Which gives us:

1911 1 13 26.7
1911 1 18 22.9
1911 2 14 36.1
1911 3 8 20.1
1911 3 14 21.8
1911 5 18 20.1
1911 6 19 29.0
1911 11 28 45.2

We can improve it further as follows:

$ awk -F, '$1 == 1911 && $4 > 20 {print $3 "/" $2 "/" $1, $4}' \
acorn01.csv

The backslash (\) at the end of the line indicates that the line continues on the next line.

This gives the following output:

13/1/1911 26.7
18/1/1911 22.9
14/2/1911 36.1
8/3/1911 20.1
14/3/1911 21.8
18/5/1911 20.1
19/6/1911 29.0
28/11/1911 45.2

It is possible to improve the output further so that it looks like this:

1911-01-13 26.7
1911-01-18 22.9
1911-02-14 36.1
1911-03-08 20.1
1911-03-14 21.8
1911-05-18 20.1
1911-06-19 29.0
1911-11-28 45.2

but that is slightly beyond the scope of this tutorial.

`BEGIN` and `END`

There are a number of special patterns. Two of these are BEGIN and END. BEGIN matches before any line of text has been read. END matches after the last line of text has been read. Suppose we wanted to have a header and a total, we could do this using BEGIN and END as follows:

BEGIN {print "Date", "Rainfall"}
$1 == 1911 && $4 > 20 {print $3 "/" $2 "/" $1,  $4; total += $4}
END   {print "Total: ", total}

Save the above in a file. Call it rain.awk. We can now run it with the following command:

awk -F, -f rain.awk acorn01.csv

And we will see the following:

Date Rainfall
13/1/1911 26.7
18/1/1911 22.9
14/2/1911 36.1
8/3/1911 20.1
14/3/1911 21.8
18/5/1911 20.1
19/6/1911 29.0
28/11/1911 45.2
Total:  221.9