Go forward to Freeing GNU Pattern Buffers.
Go backward to GNU Translate Tables.
Go up to GNU Regex Functions.
Using Registers
---------------
A group in a regular expression can match a (posssibly empty)
substring of the string that regular expression as a whole matched.
The matcher remembers the beginning and end of the substring matched by
each group.
To find out what they matched, pass a nonzero REGS argument to a GNU
matching or searching function (see GNU Matching. and *Note GNU
Searching::), i.e., the address of a structure of this type, as defined
in `regex.h':
struct re_registers
{
unsigned num_regs;
regoff_t *start;
regoff_t *end;
};
Except for (possibly) the NUM_REGS'th element (see below), the Ith
element of the `start' and `end' arrays records information about the
Ith group in the pattern. (They're declared as C pointers, but this is
only because not all C compilers accept zero-length arrays;
conceptually, it is simplest to think of them as arrays.)
The `start' and `end' arrays are allocated in various ways, depending
on the value of the `regs_allocated' field in the pattern buffer passed
to the matcher.
The simplest and perhaps most useful is to let the matcher
(re)allocate enough space to record information for all the groups in
the regular expression. If `regs_allocated' is `REGS_UNALLOCATED', the
matcher allocates 1 + RE_NSUB (another field in the pattern buffer;
see GNU Pattern Buffers.). The extra element is set to -1, and
sets `regs_allocated' to `REGS_REALLOCATE'. Then on subsequent calls
with the same pattern buffer and REGS arguments, the matcher
reallocates more space if necessary.
It would perhaps be more logical to make the `regs_allocated' field
part of the `re_registers' structure, instead of part of the pattern
buffer. But in that case the caller would be forced to initialize the
structure before passing it. Much existing code doesn't do this
initialization, and it's arguably better to avoid it anyway.
`re_compile_pattern' sets `regs_allocated' to `REGS_UNALLOCATED', so
if you use the GNU regular expression functions, you get this behavior
by default.
xx document re_set_registers
POSIX, on the other hand, requires a different interface: the caller
is supposed to pass in a fixed-length array which the matcher fills.
Therefore, if `regs_allocated' is `REGS_FIXED' the matcher simply fills
that array.
The following examples illustrate the information recorded in the
`re_registers' structure. (In all of them, `(' represents the
open-group and `)' the close-group operator. The first character in
the string STRING is at index 0.)
* If the regular expression has an I-th group not contained within
another group that matches a substring of STRING, then the
function sets `REGS->start[I]' to the index in STRING where the
substring matched by the I-th group begins, and `REGS->end[I]' to
the index just beyond that substring's end. The function sets
`REGS->start[0]' and `REGS->end[0]' to analogous information about
the entire pattern.
For example, when you match `((a)(b))' against `ab', you get:
* 0 in `REGS->start[0]' and 2 in `REGS->end[0]'
* 0 in `REGS->start[1]' and 2 in `REGS->end[1]'
* 0 in `REGS->start[2]' and 1 in `REGS->end[2]'
* 1 in `REGS->start[3]' and 2 in `REGS->end[3]'
* If a group matches more than once (as it might if followed by,
e.g., a repetition operator), then the function reports the
information about what the group *last* matched.
For example, when you match the pattern `(a)*' against the string
`aa', you get:
* 0 in `REGS->start[0]' and 2 in `REGS->end[0]'
* 1 in `REGS->start[1]' and 2 in `REGS->end[1]'
* If the I-th group does not participate in a successful match,
e.g., it is an alternative not taken or a repetition operator
allows zero repetitions of it, then the function sets
`REGS->start[I]' and `REGS->end[I]' to -1.
For example, when you match the pattern `(a)*b' against the string
`b', you get:
* 0 in `REGS->start[0]' and 1 in `REGS->end[0]'
* -1 in `REGS->start[1]' and -1 in `REGS->end[1]'
* If the I-th group matches a zero-length string, then the function
sets `REGS->start[I]' and `REGS->end[I]' to the index just beyond
that zero-length string.
For example, when you match the pattern `(a*)b' against the string
`b', you get:
* 0 in `REGS->start[0]' and 1 in `REGS->end[0]'
* 0 in `REGS->start[1]' and 0 in `REGS->end[1]'
* If an I-th group contains a J-th group in turn not contained
within any other group within group I and the function reports a
match of the I-th group, then it records in `REGS->start[J]' and
`REGS->end[J]' the last match (if it matched) of the J-th group.
For example, when you match the pattern `((a*)b)*' against the
string `abb', group 2 last matches the empty string, so you get
what it previously matched:
* 0 in `REGS->start[0]' and 3 in `REGS->end[0]'
* 2 in `REGS->start[1]' and 3 in `REGS->end[1]'
* 2 in `REGS->start[2]' and 2 in `REGS->end[2]'
When you match the pattern `((a)*b)*' against the string `abb',
group 2 doesn't participate in the last match, so you get:
* 0 in `REGS->start[0]' and 3 in `REGS->end[0]'
* 2 in `REGS->start[1]' and 3 in `REGS->end[1]'
* 0 in `REGS->start[2]' and 1 in `REGS->end[2]'
* If an I-th group contains a J-th group in turn not contained
within any other group within group I and the function sets
`REGS->start[I]' and `REGS->end[I]' to -1, then it also sets
`REGS->start[J]' and `REGS->end[J]' to -1.
For example, when you match the pattern `((a)*b)*c' against the
string `c', you get:
* 0 in `REGS->start[0]' and 1 in `REGS->end[0]'
* -1 in `REGS->start[1]' and -1 in `REGS->end[1]'
* -1 in `REGS->start[2]' and -1 in `REGS->end[2]'