Index of Section 3 Manual Pages
| Interix / SUA | pcreposix.3 | Interix / SUA |
PCRE(3) PCRE(3)
NAME
PCRE - Perl-compatible regular expressions.
SYNOPSIS OF POSIX API
#include
int regcomp(regex_t *preg, const char *pattern,
int cflags);
int regexec(regex_t *preg, const char *string,
size_t nmatch, regmatch_t pmatch[], int eflags);
size_t regerror(int errcode, const regex_t *preg,
char *errbuf, size_t errbuf_size);
void regfree(regex_t *preg);
DESCRIPTION
This set of functions provides a POSIX-style API to the
PCRE regular expression package. See the pcreapi documen-
tation for a description of the native API, which contains
additional functionality.
The functions described here are just wrapper functions
that ultimately call the PCRE native API. Their prototypes
are defined in the pcreposix.h header file, and on Unix
systems the library itself is called pcreposix.a, so can
be accessed by adding -lpcreposix to the command for link-
ing an application which uses them. Because the POSIX
functions call the native ones, it is also necessary to
add -lpcre.
I have implemented only those option bits that can be rea-
sonably mapped to PCRE native options. In addition, the
options REG_EXTENDED and REG_NOSUB are defined with the
value zero. They have no effect, but since programs that
are written to the POSIX interface often use them, this
makes it easier to slot in PCRE as a replacement library.
Other POSIX options are not even defined.
When PCRE is called via these functions, it is only the
API that is POSIX-like in style. The syntax and semantics
of the regular expressions themselves are still those of
Perl, subject to the setting of various PCRE options, as
described below. "POSIX-like in style" means that the API
approximates to the POSIX definition; it is not fully
POSIX-compatible, and in multi-byte encoding domains it is
probably even less compatible.
The header for these functions is supplied as pcreposix.h
to avoid any potential clash with other POSIX libraries.
It can, of course, be renamed or aliased as regex.h, which
is the "correct" name. It provides two structure types,
regex_t for compiled internal forms, and regmatch_t for
returning captured substrings. It also defines some con-
stants whose names start with "REG_"; these are used for
setting options and identifying error codes.
COMPILING A PATTERN
The function regcomp() is called to compile a pattern into
an internal form. The pattern is a C string terminated by
a binary zero, and is passed in the argument pattern. The
preg argument is a pointer to a regex_t structure which is
used as a base for storing information about the compiled
expression.
The argument cflags is either zero, or contains one or
more of the bits defined by the following macros:
REG_ICASE
The PCRE_CASELESS option is set when the expression is
passed for compilation to the native function.
REG_NEWLINE
The PCRE_MULTILINE option is set when the expression is
passed for compilation to the native function. Note that
this does not mimic the defined POSIX behaviour for
REG_NEWLINE (see the following section).
In the absence of these flags, no options are passed to
the native function. This means the the regex is compiled
with PCRE default semantics. In particular, the way it
handles newline characters in the subject string is the
Perl way, not the POSIX way. Note that setting PCRE_MULTI-
LINE has only some of the effects specified for REG_NEW-
LINE. It does not affect the way newlines are matched by .
(they aren't) or by a negative class such as [^a] (they
are).
The yield of regcomp() is zero on success, and non-zero
otherwise. The preg structure is filled in on success, and
one member of the structure is public: re_nsub contains
the number of capturing subpatterns in the regular expres-
sion. Various error codes are defined in the header file.
MATCHING NEWLINE CHARACTERS
This area is not simple, because POSIX and Perl take dif-
ferent views of things. It is not possible to get PCRE to
obey POSIX semantics, but then PCRE was never intended to
be a POSIX engine. The following table lists the different
possibilities for matching newline characters in PCRE:
Default Change with
. matches newline no PCRE_DOTALL
newline matches [^a] yes not changeable
$ matches \n at end yes PCRE_DOLLARENDONLY
$ matches \n in middle no PCRE_MULTILINE
^ matches \n in middle no PCRE_MULTILINE
This is the equivalent table for POSIX:
Default Change with
. matches newline yes REG_NEWLINE
newline matches [^a] yes REG_NEWLINE
$ matches \n at end no REG_NEWLINE
$ matches \n in middle no REG_NEWLINE
^ matches \n in middle no REG_NEWLINE
PCRE's behaviour is the same as Perl's, except that there
is no equivalent for PCRE_DOLLARENDONLY in Perl. In both
PCRE and Perl, there is no way to stop newline from match-
ing [^a].
The default POSIX newline handling can be obtained by set-
ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no
way to make PCRE behave exactly as for the REG_NEWLINE
action.
MATCHING A PATTERN
The function regexec() is called to match a pre-compiled
pattern preg against a given string, which is terminated
by a zero byte, subject to the options in eflags. These
can be:
REG_NOTBOL
The PCRE_NOTBOL option is set when calling the underlying
PCRE matching function.
REG_NOTEOL
The PCRE_NOTEOL option is set when calling the underlying
PCRE matching function.
The portion of the string that was matched, and also any
captured substrings, are returned via the pmatch argument,
which points to an array of nmatch structures of type reg-
match_t, containing the members rm_so and rm_eo. These
contain the offset to the first character of each sub-
string and the offset to the first character after the end
of each substring, respectively. The 0th element of the
vector relates to the entire portion of string that was
matched; subsequent elements relate to the capturing sub-
patterns of the regular expression. Unused entries in the
array have both structure members set to -1.
A successful match yields a zero return; various error
codes are defined in the header file, of which REG_NOMATCH
is the "expected" failure code.
ERROR MESSAGES
The regerror() function maps a non-zero errorcode from
either regcomp() or regexec() to a printable message. If
preg is not NULL, the error should have arisen from the
use of that structure. A message terminated by a binary
zero is placed in errbuf. The length of the message,
including the zero, is limited to errbuf_size. The yield
of the function is the size of buffer needed to hold the
whole message.
STORAGE
Compiling a regular expression causes memory to be allo-
cated and associated with the preg structure. The function
regfree() frees all such memory, after which preg may no
longer be used as a compiled expression.
AUTHOR
Philip Hazel
University Computing Service,
Cambridge CB2 3QG, England.
Last updated: 03 February 2003
Copyright (c) 1997-2003 University of Cambridge.
PCRE(3)