-f file -- Read the awk script from the specified
file rather than the command line
-F re -- Use the given regular expression re as
the field separator rather than the default "white space"
variable=value -- Initialize the awk variable
with the specified
An awk program consists of one or more awk commands separated
by either \n or semicolons.
The structure of awk commands
Each awk command consists of a selector and/or an action; both
may not be omitted in the same command. Braces surround the
action.
selector [only] -- action is print
{action}[only] -- selector is every line
selector {action} -- perform action on each line where selector
is true
Each action may have multiple statements separated from each
other by semicolons or \n
Line selection
A selector is either zero, one, or two selection criteria; in
the latter case the criteria are separated by commas
A selection criterion may be either an RE or a boolean
expression (BE) which evaluates to true or false
Commands which have no selection criteria are applied to each
line of the input data set
Commands which have one selection criterion are applied to
every line which matches or makes true the criterion depending upon
whether the criterion is an RE or a BE
Commands which have two selection criteria are applied to the
first line which matches the first criterion, the next line which
matches the second criterion and all the lines between them.
Unless a prior applied command has a next in it, every selector
is tested against every line of the input data set.
Processing
The BEGIN block(s) is(are) run (mawk's -v runs first)
Command line variables are assigned
For each line in the input data set
It is read and NR, NF, $I, etc. are set
For each command, its criteria are evaluated
If the criteria is true/matches the command is executed
After the input data set is exhausted, the END block(s) is(are)
run
Elementary awk programming
Constants
Strings are enclosed in quotes (")
Numbers are written in the usual decimal way; non-integer
values are indicated by including a period (.) in the
representation.
REs are delimited by /
Variables
Need not be declared
May contain any type of data, their data type may change over
the life of the program
Are named as any token beginning with a letter and continuing
with letters, digits and underscores
As in C, case matters; since all the built-in variables are all
uppercase, avoid this form.
Some of the commonly used built-in variables are:
NR -- The current line's sequential number
NF -- The number of fields in the current line
FS -- The input field separator; defaults to whitespace
and is reset by the -F command line parameter
Fields
Each record is separated into fields named $1,
$2, etc
$0 is the entire record
NF contains the number of fields in the current
line
FS contains the field separator RE; it defaults to the
white space RE, /[<SPACE><TAB>]*/
Fields may be accessed either by $n or by $var
where var contains a value between 0 and
NF
print/printf
print prints each of the values of $1 through
$NF separated by OFS then prints a \n onto
stdout; the default value of OFS is a blank
print value value ... prints the value(s) in
order and then puts out a \n onto stdout;
printf(format,value,value,...) prints the
value(s) using the format supplied onto stdout, just
like C. There is no default \n for each printf so
multiples can be used to build a line. There must be as many
values in the list as there are item descriptors in format.
Values in print or printf may be
constants, variables, or expressions in any order
Operators - awk has many of the same operators as C, excepting the bit operators. It
also adds some text processing operators.
Built-in functions
substr(s,p,l) -- The substring of s starting at
p and continuing for l characters
index(s1,s2) -- The first location of
s2 within s1; 0 if not found
length(e) -- The length of e,
converted to character string if necessary, in bytes
sin, cos, tan -- Standard C trig
functions
atan2(x,y) -- Standard quadrant oriented
arctangent function
exp, log -- Standard C exponential
functions
srand(s), rand() -- Random number seed and
access functions
Elementary examples and uses
length($0)>72 -- print all of the lines
whose length exceeds 72 bytes
{$2="";print} -- remove the second field
from each line
{print $2} -- print only the second field
of each line
/Ucast/{print $1 "=" $NF} -- for each line
which contains the string 'Ucast' print the first variable, an
equal sign and the last variable (awk code to create awk code; a
common trick)
BEGIN{FS="/"};NF<4 -- using '/' as a
field separator, print only those records with less than four
fields; when applied to the output of du, gives a two level
summary
{n++;t+=$4};END{print n " " t} -- when
applied to the output of an ls -l command provides a count and
total size of the listed files; I use it as part of an alias for
dir. Depending on your flavor of UNIX, the $4 may need to be
changed to $5.
$1==prv{ct++;next}{printf("%8d %s",ct,prv);ct=1;pr
v=$0} -- prints each unique record with a count of the
number of occurrences of it; presumes input is sorted
Advanced awk programming
Program structure (if, for, while, etc.)
if(boolean) statement1 else statement2 if the boolean
expression evaluates to true execute statement1, otherwise
execute statement 2
for(v=init;boolean;v change) statement Standard C for
loop, assigns v the value of init then while the
boolean expression is true executes the statement then the
v change
for(v in array) statement Assigns to v
each of the values of the subscripts of array, not in any
particular order, then executes statement
while(boolean) statement While the boolean
expression is true, execute the statement
do statement while(boolean) execute
statement, evaluate the boolean expression and if true,
repeat
statement in any of the above constructs may
be either a simple statement or a series of statements enclosed in
{}, again like C; a further requirement is that the opening
{ must be on the line with the beginning keyword (if,
for, while, do) either physically or logically
via \ .
break -- exit from an enclosing for or while
loop
continue -- restart the enclosing for or
while loop from the top
next -- stop processing the current record,
read the next record and begin processing with the first
command
exit -- terminate all input processing and,
if present, execute the END command
Arrays
There are two types of arrays in awk - standard and
generalized
Standard arrays take the usual integer subscripts, starting at
0 and going up; multidimensional arrays are allowed and behave as
expected
Generalized arrays take any type of variable(s) as subscripts,
but the subscript(s) are treated as one long string
expression.
The use of for(a in x) on a generalized array will
return all of the valid subscripts in some order, not necessarily
the one you wished.
The subscript separator is called SUBSEP and has a default
value of comma (,)
Elements can be deleted from an array via the
delete(array[subscript]) statement
Built-in variables
FILENAME -- The name of the file currently
being processed
OFS -- Output Field Separator default '
'
RS -- Input Record Separator default \n
ORS -- Output Record Separator default
\n
FNR -- Current line's number with respect to
the current file
OFMT -- Output format for printed numbers
default %.6g
RSTART -- The location of the data matched
using the match built-in function
RLENGTH -- The length of the data matched using the
match built-in function
Built-in functions
gsub(re,sub,str) -- replace, in str,
each occurrence of the regular expression re with
sub; return the number of substitutions performed
int(expr) -- return the value of expr
with all fractional parts removed
match(str,re) -- return the location in
str where the regular expression re occurs and set
RSTART and RLENGTH; if re is not found return
0
split(str,arrname,sep) -- split str
into pieces using sep as the separator and assign the pieces
in order to the elements from 1 up of arrname; use FS if
sep is not given
sprintf(format,value,value,...) -- write the
values, as the format indicates, into a
string and return that string
sub(re,sub,str) -- replace, in str,
the first occurrence of the regular expression re with
sub; return 1 if successful, 0 otherwise
system(command) -- pass command to the
local operating system to execute and return the exit status code
returned by the operating system
tolower(str) -- return a string similar to
str with all capital letters changed to lower case
Other file I/O
print and printf may have
> (or >>)
filename or | command appended
and the output will be sent to the named file or command; once a
file is opened, it remains open until explicitly closed
getline var < filename will read the next
line from filename into var. Again, once a file is
opened, it remains so until it is explicitly closed
close(filename) explicitly closes the file
named by the filename expression
Writing your own functions
A function begins with a function header of the form:
function name(argument(s), localvar(s)) {
and ends with the matching }
The value of the function is returned via a statement of the
form:
return value
Functions do not have to return a value and the value returned
by a function (either built-in or written locally) may be ignored
by just placing the function with its arguments as a whole,
separate statement
The local variables indicated in the localvars of the heading
replace the global variables of the same name until the function
completes, at which time the globals are restored
Functions may have side effects such as updating global
variables, doing I/O or running other functions with side effects;
beware the frumious bandersnatch
Take a set of time stamped data and convert the data from
absolute time and counts to relative time and average counts. The
data is presumed to be all amenable to treatment as integers. If
not, formats better the %d must be used.
Write a pair of set lines to a file called plots. For each input
line, if a file whose name is the first field on the line with a .r
appended exists, write a command to the stdout file containing the
file name and the second field from the line; also write a plot
statement to a file called plots using the third field from the
input line. After the file has been processed, add a gnuplot
command to the stdout file. If all of the output is passed to sh or
csh through a pipe, the commands will be executed.
Make lines whose first characters are 'A', 'B', or 'C' have
lengths of 25, 20, and 50 bytes respectively, changing no other
lines.
/^\+/ { hold = hold "\r" substr($0,2); next}
{ if( unfirst ) print hold
hold =""
}
/^1/ { hold = "\f" }
/^0/ { hold = "\n" }
/^-/ { hold = "\n\n" }
{ unfirst = 1
hold = hold + substr($0,2)
}
END { if(unfirst) print hold }
This routine will take FORTRAN-type output with leading ANSI
vertical motion indicators and convert it to a stream with ASCII
printer control sequences in it.
BEGIN { b=""; if(ll==0) ll=72 }
NF==0 { print b; b=""; print ""; next }
{ if(substr(b,length(b),1)=="-") {
b=substr(b,1,length(b)-1) $0 }
else b=b " " $0
while(length(b)>ll) {
i = ll
while(substr(b,i,1)=" ") I--
print substr(b,1,i-1)
b = substr(b,i+1)
}
}
END { print b; print "" }
This will take an arbitrary stream of text (where paragraphs are
indicated by consecutive \n) and make all the lines approximately
the same length. The default output line length is 72, but it may
be set via a parameter on the awk command line. Both long and short
lines are taken care of but extra spaces/tabs within the text are
not correctly handled.
BEGIN { FS = "\t" # make tab the field separator
printf("%10s %6s %5s %s\n\n",
"COUNTRY", "AREA", "POP", "CONTINENT")
}
{ printf("%10s %6d %5d %s\n", $1, $2, $3, $4)
area = area +$2
pop = pop + $3
}
END { printf("\n%10s %6d %5d\n", "TOTAL", area, pop) }
This will take a variable width table of data with four tab
separated fields and print it as a fixed length table with headings
and totals.
Important things which will bite you
$1 inside the awk script is not $1 of the shell script;
use variable assignment on the command line to move data from the
shell to the awk script,
Actions are within {}, not selections
Every selection is applied to each input line
after the previously selected actions have occurred; this
means that a previous action can cause unexpected selections or
selection misses.
Operators
" " The blank is the concatenation operator
+ - * / % All of the usual C arithmetic
operators, add, subtract, multiply,
divide and mod.
== != < <= > >= All of the usual C relational
operators, equal, not equal, less
than, less than or equal and greater
than, greater than or equal
&& || The C boolean operators and and or
= += -= *= /= %= The C assignment operators
~ !~ Matches and doesn't match
?: C conditional value operator
^ Exponentiation
++ -- Variable increment/decrement
Note the absence of the C bit operators &, |, << and >>
[s]printf format items
Format strings in the printf statement and sprintf function
consist of three different type of items: literal characters,
escaped literal characters and format items. Literal characters are
just that: characters which will print as themselves. Escaped
literal characters begin with a backslash (\) and are used to
represent control characters; the common ones are: \n for new line,
\t for tab and \r for return. Format items are used to describe how
program variables are to be printed.
All format items begin with a percent sign (%). The next part is
an optional length and precision field. The length is an integer
indicating the minimum field width of the item, negative if the
data is to be white spacethe left of the field. If the length field
begins with a zero (0), then instead of padding the value with
leading blanks, the item will be padded with leading 0s. The
precision is a decimal followed by the number of decimal digits to
be displayed for various floating point representations. Next is an
optional source field size modifier, usually 'l' (ell). The last
item is the actual source data type, commonly one of the list
below:
d Integer
f Floating point in fixed point format
e Floating point invaluel format
g Floating point in "best fit" format; integer, fixed
point, or exponential; depending on exact value
s Character string
c Integer to be interpreted as a character
x Integer to be printed as hexadecimal
Examples:
%-20s Print a string in the left portion of a 20 character
field
%d Print an integer in however many spaces it takes
%6d Print an integer in at least 6 spaces; used to format
pretty output
%9ld Print a long integer in at least 9 spaces
%09ld Print a long integer in at least 9 spaces with leading
0s, not blanks
%.6f Print a float with 6 digits after the decimal and as
many before it as needed
%10.6f Print a float in a 10 space field with 6 digits after
the decimal