Go forward to Constant Size.
Go backward to Changing Fields.
Go up to Reading Files.
Specifying how Fields are Separated
===================================
(This section is rather long; it describes one of the most
fundamental operations in `awk'. If you are a novice with `awk', we
recommend that you re-read this section after you have studied the
section on regular expressions, See Regular Expressions as Patterns: Regexp.)
The way `awk' splits an input record into fields is controlled by
the "field separator", which is a single character or a regular
expression. `awk' scans the input record for matches for the
separator; the fields themselves are the text between the matches. For
example, if the field separator is `oo', then the following line:
moo goo gai pan
would be split into three fields: `m', ` g' and ` gai pan'.
The field separator is represented by the built-in variable `FS'.
Shell programmers take note! `awk' does not use the name `IFS' which
is used by the shell.
You can change the value of `FS' in the `awk' program with the
assignment operator, `=' (see Assignment Expressions: Assignment Ops.). Often the right time to do this is at the beginning of
execution, before any input has been processed, so that the very first
record will be read with the proper separator. To do this, use the
special `BEGIN' pattern (see `BEGIN' and `END' Special Patterns: BEGIN/END.). For example, here we set the value of `FS' to the string
`","':
awk 'BEGIN { FS = "," } ; { print $2 }'
Given the input line,
John Q. Smith, 29 Oak St., Walamazoo, MI 42139
this `awk' program extracts the string ` 29 Oak St.'.
Sometimes your input data will contain separator characters that
don't separate fields the way you thought they would. For instance, the
person's name in the example we've been using might have a title or
suffix attached, such as `John Q. Smith, LXIX'. From input containing
such a name:
John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139
the previous sample program would extract ` LXIX', instead of ` 29 Oak
St.'. If you were expecting the program to print the address, you
would be surprised. So choose your data layout and separator
characters carefully to prevent such problems.
As you know, by default, fields are separated by whitespace sequences
(spaces and tabs), not by single spaces: two spaces in a row do not
delimit an empty field. The default value of the field separator is a
string `" "' containing a single space. If this value were interpreted
in the usual way, each space character would separate fields, so two
spaces in a row would make an empty field between them. The reason
this does not happen is that a single space as the value of `FS' is a
special case: it is taken to specify the default manner of delimiting
fields.
If `FS' is any other single character, such as `","', then each
occurrence of that character separates two fields. Two consecutive
occurrences delimit an empty field. If the character occurs at the
beginning or the end of the line, that too delimits an empty field. The
space character is the only single character which does not follow these
rules.
More generally, the value of `FS' may be a string containing any
regular expression. Then each match in the record for the regular
expression separates fields. For example, the assignment:
FS = ", \t"
makes every area of an input line that consists of a comma followed by a
space and a tab, into a field separator. (`\t' stands for a tab.)
For a less trivial example of a regular expression, suppose you want
single spaces to separate fields the way single commas were used above.
You can set `FS' to `"[ ]"'. This regular expression matches a single
space and nothing else.
`FS' can be set on the command line. You use the `-F' argument to
do so. For example:
awk -F, 'PROGRAM' INPUT-FILES
sets `FS' to be the `,' character. Notice that the argument uses a
capital `F'. Contrast this with `-f', which specifies a file
containing an `awk' program. Case is significant in command options:
the `-F' and `-f' options have nothing to do with each other. You can
use both options at the same time to set the `FS' argument *and* get an
`awk' program from a file.
The value used for the argument to `-F' is processed in exactly the
same way as assignments to the built-in variable `FS'. This means that
if the field separator contains special characters, they must be escaped
appropriately. For example, to use a `\' as the field separator, you
would have to type:
# same as FS = "\\"
awk -F\\\\ '...' files ...
Since `\' is used for quoting in the shell, `awk' will see `-F\\'.
Then `awk' processes the `\\' for escape characters (*note Constant
Expressions: Constants.), finally yielding a single `\' to be used for
the field separator.
As a special case, in compatibility mode (see Invoking `awk': Command Line.), if the argument to `-F' is `t', then `FS' is set to the
tab character. (This is because if you type `-F\t', without the quotes,
at the shell, the `\' gets deleted, so `awk' figures that you really
want your fields to be separated with tabs, and not `t's. Use `-v
FS="t"' on the command line if you really do want to separate your
fields with `t's.)
For example, let's use an `awk' program file called `baud.awk' that
contains the pattern `/300/', and the action `print $1'. Here is the
program:
/300/ { print $1 }
Let's also set `FS' to be the `-' character, and run the program on
the file `BBS-list'. The following command prints a list of the names
of the bulletin boards that operate at 300 baud and the first three
digits of their phone numbers:
awk -F- -f baud.awk BBS-list
It produces this output:
aardvark 555
alpo
barfly 555
bites 555
camelot 555
core 555
fooey 555
foot 555
macfoo 555
sdace 555
sabafoo 555
Note the second line of output. If you check the original file, you
will see that the second line looked like this:
alpo-net 555-3412 2400/1200/300 A
The `-' as part of the system's name was used as the field
separator, instead of the `-' in the phone number that was originally
intended. This demonstrates why you have to be careful in choosing
your field and record separators.
The following program searches the system password file, and prints
the entries for users who have no password:
awk -F: '$2 == ""' /etc/passwd
Here we use the `-F' option on the command line to set the field
separator. Note that fields in `/etc/passwd' are separated by colons.
The second field represents a user's encrypted password, but if the
field is empty, that user has no password.
According to the POSIX standard, `awk' is supposed to behave as if
each record is split into fields at the time that it is read. In
particular, this means that you can change the value of `FS' after a
record is read, but before any of the fields are referenced. The value
of the fields (i.e. how they were split) should reflect the old value
of `FS', not the new one.
However, many implementations of `awk' do not do this. Instead,
they defer splitting the fields until a field reference actually
happens, using the *current* value of `FS'! This behavior can be
difficult to diagnose. The following example illustrates the results of
the two methods. (The `sed' command prints just the first line of
`/etc/passwd'.)
sed 1q /etc/passwd | awk '{ FS = ":" ; print $1 }'
will usually print
root
on an incorrect implementation of `awk', while `gawk' will print
something like
root:nSijPlPhZZwgE:0:0:Root:/:
There is an important difference between the two cases of `FS = " "'
(a single blank) and `FS = "[ \t]+"' (which is a regular expression
matching one or more blanks or tabs). For both values of `FS', fields
are separated by runs of blanks and/or tabs. However, when the value of
`FS' is `" "', `awk' will strip leading and trailing whitespace from
the record, and then decide where the fields are.
For example, the following expression prints `b':
echo ' a b c d ' | awk '{ print $2 }'
However, the following prints `a':
echo ' a b c d ' | awk 'BEGIN { FS = "[ \t]+" } ; { print $2 }'
In this case, the first field is null.
The stripping of leading and trailing whitespace also comes into
play whenever `$0' is recomputed. For instance, this pipeline
echo ' a b c d' | awk '{ print; $2 = $2; print }'
produces this output:
a b c d
a b c d
The first `print' statement prints the record as it was read, with
leading whitespace intact. The assignment to `$2' rebuilds `$0' by
concatenating `$1' through `$NF' together, separated by the value of
`OFS'. Since the leading whitespace was ignored when finding `$1', it
is not part of the new `$0'. Finally, the last `print' statement
prints the new `$0'.
The following table summarizes how fields are split, based on the
value of `FS'.
`FS == " "'
Fields are separated by runs of whitespace. Leading and trailing
whitespace are ignored. This is the default.
`FS == ANY SINGLE CHARACTER'
Fields are separated by each occurrence of the character. Multiple
successive occurrences delimit empty fields, as do leading and
trailing occurrences.
`FS == REGEXP'
Fields are separated by occurrences of characters that match
REGEXP. Leading and trailing matches of REGEXP delimit empty
fields.