What is AWK?
Awk is a programming language designed for text processing. At its core, AWK is used for scanning and processing patterns in files and then taking action based on those patterns.
Basic Syntax of AWK
The essential syntax of an awk
command is awk [options] 'program' input-file(s)
. Here, the 'program' consists of a series of 'patterns' and corresponding 'actions'. The action is enclosed in {}
and is performed when the pattern matches.
Getting Started with AWK
Printing Lines and Fields
To print all lines from a file, use:
awk '{print}' filename
To print specific fields (e.g., first and second), the command is:
awk '{print $1, $2}' filename
Handling Field Separators
By default, AWK treats spaces and tabs as field separators. To specify a tab as a separator, use:
awk -F '\t' '{print $1, $2}' filename
Pattern Matching
AWK excels in pattern matching:
awk '/pattern/ {print $0}' filename
For case-insensitive matching, use tolower
:
awk 'tolower($0) ~ /pattern/ {print $0}' filename
Let's try to parse ls -ltr
command output with awk, and try to print just the filenames.
The challenge arises when you encounter filenames with spaces. A simple awk
command may not correctly handle such cases. Here's a more nuanced approach:
ls -ltr | awk '{for (i=9; i<=NF; i++) printf "%s%s", $i, (i==NF ? "\n" : " ")}'
Built-in Variables
$0
: The entire current line.$1
,$2
, ...: The first, second, etc., fields of the current line.NR
: Number of records (typically lines) processed so far.NF
: Number of fields in the current record.
this command works as follows:
It loops from the 9th field (which is where filenames start in the
ls -ltr
output) to the last field (NF
).It prints each of these fields, adding a space between them unless it's the last field, in which case it adds a newline.
This effectively reconstructs filenames with spaces.
More Complex Operations
Summing a Column:
awk '{sum += $1} END {print sum}' filename
This sums up the values in the first field of each line.
Text Processing:
awk '/pattern/ {gsub(/old/, "new"); print}' filename
This finds lines matching
pattern
, replaces 'old' with 'new' in them, and prints the result.
Options in AWK
The options
in awk
are command-line flags that modify its behavior. Some common options include:
-F
: Specifies a field separator.- Example:
-F ':'
uses a colon as the field separator.
- Example:
-v
: Allows you to set a variable.- Example:
-v OFS=','
sets the output field separator to a comma.
- Example:
-f
: Specifies that theawk
program is to be read from a file.- Example:
awk -f program.awk inputfile
reads theawk
program fromprogram.awk
.
- Example:
These options can be used to alter how awk
processes input files and handles data.
BEGIN
The BEGIN
keyword in awk
is used to specify actions that should be executed before any input lines are read. It's a special kind of pattern which does not process input text but sets up initial conditions or configurations for the awk
program. The BEGIN
block is executed only once, at the very start of the program, making it an ideal place for initialization tasks.
While we've covered the basic usage of awk
, delving into some of its internal variables will significantly enhance your ability to manipulate and format text data efficiently. These variables include FS
, RS
, OFS
, and ORS
.
Field Separator (FS)
What it does:
FS
stands for Field Separator. It's the character thatawk
uses to divide fields on an input line.Default Value: By default,
FS
is set to white space, meaning any space or tab character.Customization: You can change
FS
to a different character, like a comma or colon, to match the structure of your input data. This is often done in theBEGIN
block of anawk
program.awk 'BEGIN {FS=":"} {print $1}' filename
Record Separator (RS)
What it does:
RS
is the Record Separator. It defines the end of a record.Default Value: The default
RS
is a newline character, meaning each line in the input is treated as a separate record.Usage: Changing
RS
can be useful for processing data where records are separated by different characters or patterns.
Output Field Separator (OFS)
What it does:
OFS
stands for Output Field Separator. It's used byawk
to separate fields when printing output.Default Value: The default
OFS
is a space.Functionality: If your
awk
print statement contains multiple fields separated by commas,awk
will insert theOFS
value between them.awk 'BEGIN {OFS=","} {print $1, $2}' filename
Output Record Separator (ORS)
What it does:
ORS
is the Output Record Separator, dictating howawk
separates output records.Default Value: The default for
ORS
is a newline character, meaning each print statement results in a new line.Implications: By altering
ORS
, you can change how the output is formatted in terms of record separation.
AWK is a powerful tool that can greatly simplify text processing tasks in Linux. Whether it's extracting columns from a file or parsing complex command outputs.
If you liked this blog, you can follow me on twitter, and learn something new with me.