Awk the Unchained
|awk is a powerful text-processing language great for working with files (like CSV and logs), processing data streams (like the output of commands), and handling tasks like transforming data in a variety of ways.
I see awk as a tool that makes the difference and brings you to the next level by setting you free from the standard way of elaborating stream of data and files.
Syntax «
awk 'pattern { action }' <path_to_file>
pattern: This specifies when the action should be executed. It can be a regular expression, a condition (like matching a specific value), or simply true (to apply the action to every line). If no pattern is provided, the action is applied to every line of input.
action: This defines what awk should do when the pattern matches. The action
is enclosed in curly braces {}
. If no action is provided, awk will simply print
the matched lines.
<path_to_file>: This is the path to the file to be processed. If the path is not provided, awk will process input from the standard input.
Basic use «
Print specific columns «
To print ids of the running containers we can print only the first column of the
docker ps
command.
docker ps | awk '{print $1}'
$1 represents the first column in each line of the docker ps
output. In this
case, it’s the container ID.
Using a pattern to match lines «
If I want to get the ID of the Postgres container
docker ps | awk ' /postgres/ {print $1}'
Perform an action on a line «
If I want to get the total amount of space occupied by the files in the current directory
ls -l | awk '{ sum += $5 } END {print sum}'
END: is a block that will be executed after all the lines have been processed.
There is also the BEGIN block that is executed before any line is processed.
this is an example of the usage
awk 'BEGIN { print "Start processing" }
{ print $1 }
END { print "Finished processing" }' <path_to_file>
Advanced use «
Filter out rows from a CSV file «
Given a CSV file like this
id,product_id,business_id
1,23,45
9,45,89
39,356,45
...
We want to get the header and all the lines of a specific business_id
awk -F',' 'NR==1 || $3==45 { print $0 }' <path_to_file>
- -F: is used to specify the column separator. By default, it is a space but with our CSV we are using a comma ‘,’
- NR: holds the number of the line in which awk is operating at the moment
- ||: it is the boolean or
Remove duplicated lines without changing the order of the lines «
With awk, it is possible to remove all duplicate lines, regardless of their
position in the file, unlike the uniq
command, which only removes adjacent
duplicate lines.
awk '!seen[$0]++' <path_to_file>
By using seen[$0]
, we track whether each line has appeared before. If a line
has already been seen, it’s skipped; otherwise, it’s printed.
We are not limited to use only $0
, with
awk '!seen[$1$2]++' <path_to_file>
$1$2
concatenates the values of the first and second fields. This will ensure
that duplicates are only removed based on the combination of the first two
fields, while ignoring the rest of the line.
Explanation:
- seen[] is an associative array (similar to a map or dictionary) that awk initializes the first time we access it.
- seen[$0] accesses the value stored in the map by the key $0
- ! negates the returned values of the map:
- In awk, any nonzero numeric value or any nonempty string value is true.
- By default, variables are initialized to the empty string, which is zero if converted to a number.
- That being said:
- If seen[$0] = 0 or empty string then the negation is resolved to true.
- If seen[$0] > 0 then the negation is resolved to false.
- The ++ is a post-increment operator. This means that the value of seen[$0] is accessed first to evaluate the expression, and only then is it incremented.
Summing up, the whole expression evaluates to:
true if the occurrences are zero/empty string
false if the occurrences are greater than zero
If the pattern succeeds, then the associated action is executed. If we don’t provide an action, awk, by default, prints the line.