How to use the command 'awk' (with examples)
AWK is a versatile programming language that is primarily used for text processing and data extraction in files. Its powerful scripting capabilities allow users to perform complex manipulations on text data with relative ease. AWK scripts are widely employed for data reporting, filtering, and transforming structured data. Whether you are a data analyst or simply looking to automate processing tasks in text files, AWK can significantly streamline your workflow.
Use case 1: Print the fifth column (a.k.a. field) in a space-separated file
Code:
awk '{print $5}' path/to/file
Motivation:
This use case is helpful when you need to extract specific information from a structured dataset. For instance, if you have a space-separated values (SSV) file with multiple columns of data and you only want to look at the information in the fifth column, this code will do just that. It’s commonly used in scenarios where datasets have fixed formats and precise data needs to be isolated for further examination or processing.
Explanation:
awk
: Invokes the AWK command line tool.'{print $5}'
: This program block tells AWK to print the fifth field of each line. In AWK, fields are delimited by whitespace by default.'path/to/file'
: Specifies the path to the file being processed.
Example Output:
Imagine you have a file containing the following space-separated data:
Name Age Gender Department Salary
John 34 Male Sales 55000
Alice 29 Female HR 70000
The command will output:
Salary
55000
70000
Use case 2: Print the second column of the lines containing “foo” in a space-separated file
Code:
awk '/foo/ {print $2}' path/to/file
Motivation:
This example is focused on data filtering based on a specific condition. It is useful when you need information from lines containing a certain keyword within a structured file. Scenarios could involve log files where you want to isolate events related to “foo” and display related data from those lines.
Explanation:
'/foo/'
: This pattern tells AWK to match lines that contain the string “foo”.'{print $2}'
: This action prints the second field of each matched line.'path/to/file'
: Indicates the location of the file to be operated on.
Example Output:
Given the following content in the file:
bar 123 xyz
foo 456 abc
baz 789
foo 101112 def
The output will be:
456
101112
Use case 3: Print the last column of each line in a file, using a comma (instead of space) as a field separator
Code:
awk -F ',' '{print $NF}' path/to/file
Motivation:
This practical use case allows for the extraction of the last piece of data from each line in a comma-separated values (CSV) file. This functionality is often leveraged in data post-processing where you need to isolate trailing data points, such as timestamps or final status codes, across numerous entries for analysis or verification.
Explanation:
-F ','
: Specifies the field separator as a comma rather than the usual space.'{print $NF}'
:NF
is a special AWK variable that holds the number of fields in the current record. Therefore,$NF
returns the last field.'path/to/file'
: Points to the input file.
Example Output:
If the file contains:
jan,1000,a
feb,2000,b
mar,3000,c
The output will be:
a
b
c
Use case 4: Sum the values in the first column of a file and print the total
Code:
awk '{s+=$1} END {print s}' path/to/file
Motivation:
Summing values is a common task when handling numerical data. This command line efficiently accumulates numbers present in the first column, which is typical in scenarios such as financial analysis (e.g., expenses), inventory counts, or survey results where totals need to be calculated.
Explanation:
'{s+=$1}'
: This part processes each line, adding values in the first column to the sums
.END {print s}
: After processing all lines, AWK executes theEND
block and prints the final sum.'path/to/file'
: Specifies the file containing the data to be summed.
Example Output:
Given data in the file:
10
20
30
40
The output will be:
100
Use case 5: Print every third line starting from the first line
Code:
awk 'NR%3==1' path/to/file
Motivation:
This use case is tailored for data sampling. By printing every third line, it helps in scenarios where data reduction is necessary, or when you want to extract a representative sample for analysis without loading the entire dataset into memory.
Explanation:
NR%3==1
:NR
is a built-in AWK variable indicating the current record number (i.e., line number). The expression takes every third line, starting with the first (line 1, line 4, line 7, etc.).'path/to/file'
: Designates the target file from which lines are extracted.
Example Output:
For a file containing:
Line1
Line2
Line3
Line4
Line5
Line6
Line7
Line8
Line9
The output will be:
Line1
Line4
Line7
Use case 6: Print different values based on conditions
Code:
awk '{if ($1 == "foo") print "Exact match foo"; else if ($1 ~ "bar") print "Partial match bar"; else print "Baz"}' path/to/file
Motivation:
Conditional logic in AWK scripts facilitates decision-making processes based on the contents of each line or field. This scenario is ideal for applications where data needs to be categorized or labeled based on certain conditions, such as filtering logs or generating reports with customized summaries.
Explanation:
'if ($1 == "foo") print "Exact match foo";'
: Checks if the first field is exactly “foo” and prints the corresponding message.'else if ($1 ~ "bar") print "Partial match bar";'
: Checks if the first field contains “bar” and issues another message.'else print "Baz"'
: For all other cases, prints the message “Baz”.'path/to/file'
: Indicates the file to process.
Example Output:
If the content is:
foo 123
foobar 456
qux 789
The output will be:
Exact match foo
Partial match bar
Baz
Use case 7: Print all the lines which the 10th column value is between a min and a max
Code:
awk '($10 >= min_value && $10 <= max_value)' path/to/file
Motivation:
Such filtering is useful for data validation or selection tasks. When working with datasets where you need to identify entries that match specific numerical criteria, this technique helps focus on relevant data points without altering the base structure of your file.
Explanation:
($10 >= min_value && $10 <= max_value)
: Selects lines based on whether the 10th field’s value falls within a defined range.'path/to/file'
: Details which file the command will assess.
Example Output:
Assuming:
... ... ... ... ... ... ... ... ... 5 ...
... ... ... ... ... ... ... ... ... 10 ...
... ... ... ... ... ... ... ... ... 15 ...
... ... ... ... ... ... ... ... ... 20 ...
With min_value = 10
and max_value = 15
, the output is:
... ... ... ... ... ... ... ... ... 10 ...
... ... ... ... ... ... ... ... ... 15 ...
Use case 8: Print table of users with UID >=1000 with header and formatted output, using colon as separator
Code:
awk 'BEGIN {FS=":";printf "%-20s %6s %25s\n", "Name", "UID", "Shell"} $4 >= 1000 {printf "%-20s %6d %25s\n", $1, $4, $7}' /etc/passwd
Motivation:
Formatted output significantly enhances readability for reports. In system administration, user details with specific UIDs are often queried, and presenting this information in a structured manner is invaluable for audits or documentation purposes.
Explanation:
BEGIN {FS=":";
: Sets the field separator to a colon, in preparation for processing the/etc/passwd
file.printf "%-20s %6s %25s\n", "Name", "UID", "Shell"}
: Outputs the header with predefined widths and alignments.'$4 >= 1000 {printf "%-20s %6d %25s\n", $1, $4, $7}'
: Evaluates each line’s fourth field (UID). If it meets or exceeds 1000, it prints appropriately formatted user data.'/etc/passwd'
: The file containing user information on Unix-like systems.
Example Output:
Name UID Shell
user1 1001 /bin/bash
user2 1005 /bin/zsh
Conclusion
AWK is an incredibly versatile tool for processing and analyzing text files, providing powerful capabilities in data manipulation and extraction. By understanding its syntax and application in various scenarios, users can effectively manage and transform datasets for a wide array of practical purposes, each beautifully illustrated with examples here. From extracting specific information to conditional processing and formatted output, AWK remains an essential part of any data enthusiast or administrator’s toolkit.