Awk the Unchained

|

awk is a powerful text-processing language great for working with files (like CSV and logs), processing data streams (like the output of commands), and handling tasks like transforming data in a variety of ways.

I see awk as a tool that makes the difference and brings you to the next level by setting you free from the standard way of elaborating stream of data and files.

Syntax «

awk 'pattern { action }' <path_to_file>

pattern: This specifies when the action should be executed. It can be a regular expression, a condition (like matching a specific value), or simply true (to apply the action to every line). If no pattern is provided, the action is applied to every line of input.

action: This defines what awk should do when the pattern matches. The action is enclosed in curly braces {}. If no action is provided, awk will simply print the matched lines.

<path_to_file>: This is the path to the file to be processed. If the path is not provided, awk will process input from the standard input.

Basic use «

To print ids of the running containers we can print only the first column of the docker ps command.

docker ps | awk '{print $1}'

$1 represents the first column in each line of the docker ps output. In this case, it’s the container ID.

Using a pattern to match lines «

If I want to get the ID of the Postgres container

docker ps | awk ' /postgres/ {print $1}'

Perform an action on a line «

If I want to get the total amount of space occupied by the files in the current directory

ls -l | awk '{ sum += $5 } END {print sum}'

END: is a block that will be executed after all the lines have been processed.

There is also the BEGIN block that is executed before any line is processed.

this is an example of the usage

awk 'BEGIN { print "Start processing" }
     { print $1 }
     END { print "Finished processing" }' <path_to_file>

Advanced use «

Filter out rows from a CSV file «

Given a CSV file like this

id,product_id,business_id
1,23,45
9,45,89
39,356,45
...

We want to get the header and all the lines of a specific business_id

awk -F',' 'NR==1 || $3==45 { print $0 }' <path_to_file>

Remove duplicated lines without changing the order of the lines «

With awk, it is possible to remove all duplicate lines, regardless of their position in the file, unlike the uniq command, which only removes adjacent duplicate lines.

awk '!seen[$0]++' <path_to_file>

By using seen[$0], we track whether each line has appeared before. If a line has already been seen, it’s skipped; otherwise, it’s printed.

We are not limited to use only $0, with

awk '!seen[$1$2]++' <path_to_file>

$1$2 concatenates the values of the first and second fields. This will ensure that duplicates are only removed based on the combination of the first two fields, while ignoring the rest of the line.

Explanation:

Summing up, the whole expression evaluates to:

true if the occurrences are zero/empty string

false if the occurrences are greater than zero

If the pattern succeeds, then the associated action is executed. If we don’t provide an action, awk, by default, prints the line.