Index of Section 3 Manual Pages

Interix / SUApcreapi.3Interix / SUA

PCRE(3)                                                   PCRE(3)



NAME
       PCRE - Perl-compatible regular expressions

SYNOPSIS OF PCRE API

       #include 

       pcre *pcre_compile(const char *pattern, int options,
            const char **errptr, int *erroffset,
            const unsigned char *tableptr);

       pcre_extra *pcre_study(const pcre *code, int options,
            const char **errptr);

       int pcre_exec(const pcre *code, const pcre_extra *extra,
            const char *subject, int length, int startoffset,
            int options, int *ovector, int ovecsize);

       int pcre_copy_named_substring(const pcre *code,
            const char *subject, int *ovector,
            int stringcount, const char *stringname,
            char *buffer, int buffersize);

       int pcre_copy_substring(const char *subject, int *ovector,
            int stringcount, int stringnumber, char *buffer,
            int buffersize);

       int pcre_get_named_substring(const pcre *code,
            const char *subject, int *ovector,
            int stringcount, const char *stringname,
            const char **stringptr);

       int pcre_get_stringnumber(const pcre *code,
            const char *name);

       int pcre_get_substring(const char *subject, int *ovector,
            int stringcount, int stringnumber,
            const char **stringptr);

       int pcre_get_substring_list(const char *subject,
            int   *ovector,   int   stringcount,    const    char
       ***listptr);

       void pcre_free_substring(const char *stringptr);

       void pcre_free_substring_list(const char **stringptr);

       const unsigned char *pcre_maketables(void);

       int   pcre_fullinfo(const  pcre  *code,  const  pcre_extra
       *extra,
            int what, void *where);

       int  pcre_info(const  pcre   *code,   int   *optptr,   int
       *firstcharptr);

       int pcre_config(int what, void *where);

       char *pcre_version(void);

       void *(*pcre_malloc)(size_t);

       void (*pcre_free)(void *);

       void *(*pcre_stack_malloc)(size_t);

       void (*pcre_stack_free)(void *);

       int (*pcre_callout)(pcre_callout_block *);


PCRE API

       PCRE  has  its  own native API, which is described in this
       document. There is also a set of  wrapper  functions  that
       correspond to the POSIX regular expression API.  These are
       described in the pcreposix documentation.

       The native API function  prototypes  are  defined  in  the
       header file pcre.h, and on Unix systems the library itself
       is called libpcre.a, so can be accessed by  adding  -lpcre
       to  the command for linking an application which calls it.
       The  header  file  defines  the  macros   PCRE_MAJOR   and
       PCRE_MINOR  to contain the major and minor release numbers
       for the library. Applications can  use  these  to  include
       support for different releases.

       The    functions    pcre_compile(),    pcre_study(),   and
       pcre_exec() are used for compiling  and  matching  regular
       expressions.  A  sample program that demonstrates the sim-
       plest way of using them is given in the  file  pcredemo.c.
       The pcresample documentation describes how to run it.

       There  are  convenience  functions for extracting captured
       substrings from a matched subject string. They are:

         pcre_copy_substring()
         pcre_copy_named_substring()
         pcre_get_substring()
         pcre_get_named_substring()
         pcre_get_substring_list()

       pcre_free_substring() and  pcre_free_substring_list()  are
       also  provided,  to  free  the  memory  used for extracted
       strings.

       The function pcre_maketables()  is  used  (optionally)  to
       build  a set of character tables in the current locale for
       passing to pcre_compile().

       The function pcre_fullinfo() is used to find out  informa-
       tion  about a compiled pattern; pcre_info() is an obsolete
       version which returns only some of the available  informa-
       tion,  but  is  retained for backwards compatibility.  The
       function pcre_version() returns a pointer to a string con-
       taining the version of PCRE and its date of release.

       The  global  variables pcre_malloc and pcre_free initially
       contain the entry points  of  the  standard  malloc()  and
       free()  functions respectively. PCRE calls the memory man-
       agement functions via these variables, so a  calling  pro-
       gram can replace them if it wishes to intercept the calls.
       This should be done before calling any PCRE functions.

       The global variables pcre_stack_malloc and pcre_stack_free
       are  also  indirections  to  memory  management functions.
       These special functions are used only when  PCRE  is  com-
       piled  to  use  the  heap for remembering data, instead of
       recursive function calls. This is a  non-standard  way  of
       building  PCRE,  for use in environments that have limited
       stacks. Because of the greater use of  memory  management,
       it  runs  more  slowly. Separate functions are provided so
       that special-purpose external code can be  used  for  this
       case.  When  used,  these functions are always called in a
       stack-like manner (last obtained, first freed), and always
       for memory blocks of the same size.

       The  global variable pcre_callout initially contains NULL.
       It can be set by the caller to a "callout" function, which
       PCRE  will then call at specified points during a matching
       operation. Details are given in the pcrecallout documenta-
       tion.


MULTITHREADING

       The PCRE functions can be used in multi-threading applica-
       tions, with the proviso that the memory  management  func-
       tions    pointed    to    by    pcre_malloc,    pcre_free,
       pcre_stack_malloc, and pcre_stack_free,  and  the  callout
       function  pointed  to  by  pcre_callout, are shared by all
       threads.

       The compiled form of a regular expression is  not  altered
       during  matching,  so the same compiled pattern can safely
       be used by several threads at once.


CHECKING BUILD-TIME OPTIONS

       int pcre_config(int what, void *where);

       The function pcre_config() makes it possible  for  a  PCRE
       client  to discover which optional features have been com-
       piled into the PCRE library. The  pcrebuild  documentation
       has more details about these optional features.

       The first argument for pcre_config() is an integer, speci-
       fying which information is required; the  second  argument
       is  a  pointer to a variable into which the information is
       placed. The following information is available:

         PCRE_CONFIG_UTF8

       The output is an integer that is set to one if UTF-8  sup-
       port is available; otherwise it is set to zero.

         PCRE_CONFIG_NEWLINE

       The  output  is an integer that is set to the value of the
       code that is used for the newline character. It is  either
       linefeed (10) or carriage return (13), and should normally
       be the standard character for your operating system.

         PCRE_CONFIG_LINK_SIZE

       The output is an integer that contains the number of bytes
       used for internal linkage in compiled regular expressions.
       The value is 2, 3, or 4. Larger values allow larger  regu-
       lar  expressions  to be compiled, at the expense of slower
       matching. The default value of 2 is sufficient for all but
       the  most  massive  patterns, since it allows the compiled
       pattern to be up to 64K in size.

         PCRE_CONFIG_POSIX_MALLOC_THRESHOLD

       The output is an integer that contains the threshold above
       which  the  POSIX  interface uses malloc() for output vec-
       tors. Further details are given in the pcreposix  documen-
       tation.

         PCRE_CONFIG_MATCH_LIMIT

       The  output is an integer that gives the default limit for
       the number  of  internal  matching  function  calls  in  a
       pcre_exec()  execution.  Further  details  are  given with
       pcre_exec() below.

         PCRE_CONFIG_STACKRECURSE

       The output is an integer that is set to  one  if  internal
       recursion  is implemented by recursive function calls that
       use the stack to remember their state. This is  the  usual
       way  that PCRE is compiled. The output is zero if PCRE was
       compiled to use blocks of data  on  the  heap  instead  of
       recursive  function calls. In this case, pcre_stack_malloc
       and pcre_stack_free are called to manage memory blocks  on
       the heap, thus avoiding the use of the stack.


COMPILING A PATTERN

       pcre *pcre_compile(const char *pattern, int options,
            const char **errptr, int *erroffset,
            const unsigned char *tableptr);


       The function pcre_compile() is called to compile a pattern
       into an internal form. The pattern is a  C  string  termi-
       nated by a binary zero, and is passed in the argument pat-
       tern. A pointer to  a  single  block  of  memory  that  is
       obtained  via  pcre_malloc  is returned. This contains the
       compiled code and related data. The pcre type  is  defined
       for  the returned block; this is a typedef for a structure
       whose contents are not externally defined. It is up to the
       caller to free the memory when it is no longer required.

       Although the compiled code of a PCRE regex is relocatable,
       that is, it does not depend on memory location,  the  com-
       plete pcre data block is not fully relocatable, because it
       contains a copy of the  tableptr  argument,  which  is  an
       address (see below).

       The options argument contains independent bits that affect
       the compilation. It should  be  zero  if  no  options  are
       required.  Some  of the options, in particular, those that
       are compatible with Perl, can also be set and  unset  from
       within  the pattern (see the detailed description of regu-
       lar expressions in  the  pcrepattern  documentation).  For
       these options, the contents of the options argument speci-
       fies their initial settings at the  start  of  compilation
       and  execution. The PCRE_ANCHORED option can be set at the
       time of matching as well as at compile time.

       If errptr is NULL,  pcre_compile()  returns  NULL  immedi-
       ately.   Otherwise,  if  compilation  of  a pattern fails,
       pcre_compile() returns NULL, and sets the variable pointed
       to by errptr to point to a textual error message. The off-
       set from the start of the pattern to the  character  where
       the error was discovered is placed in the variable pointed
       to by erroffset, which must not be  NULL.  If  it  is,  an
       immediate error is given.

       If  the  final  argument,  tableptr,  is NULL, PCRE uses a
       default set of character tables which are built when it is
       compiled,  using the default C locale. Otherwise, tableptr
       must be the result of a call to pcre_maketables(). See the
       section on locale support below.

       This code fragment shows a typical straightforward call to
       pcre_compile():

         pcre *re;
         const char *error;
         int erroffset;
         re = pcre_compile(
           "^A.*Z",          /* the pattern */
           0,                /* default options */
           &error,           /* for error message */
           &erroffset,       /* for error offset */
           NULL);            /* use default character tables */

       The following option bits are defined:

         PCRE_ANCHORED

       If  this  bit  is  set,  the  pattern  is  forced  to   be
       "anchored",  that  is,  it is constrained to match only at
       the first matching point in  the  string  which  is  being
       searched  (the  "subject string"). This effect can also be
       achieved by appropriate constructs in the pattern  itself,
       which is the only way to do it in Perl.

         PCRE_CASELESS

       If  this  bit  is  set,  letters in the pattern match both
       upper and lower case letters. It is equivalent  to  Perl's
       /i  option,  and  it  can be changed within a pattern by a
       (?i) option setting.

         PCRE_DOLLAR_ENDONLY

       If this bit is set, a dollar metacharacter in the  pattern
       matches  only  at  the  end of the subject string. Without
       this option, a dollar also matches immediately before  the
       final  character  if  it  is a newline (but not before any
       other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
       if  PCRE_MULTILINE  is set. There is no equivalent to this
       option in Perl, and no way to set it within a pattern.

         PCRE_DOTALL

       If this bit is set, a  dot  metacharater  in  the  pattern
       matches  all  characters,  including newlines. Without it,
       newlines are excluded. This option is equivalent to Perl's
       /s  option,  and  it  can be changed within a pattern by a
       (?s) option setting. A negative class such as [^a]  always
       matches a newline character, independent of the setting of
       this option.

         PCRE_EXTENDED

       If this bit is set, whitespace data characters in the pat-
       tern  are  totally ignored except when escaped or inside a
       character class. Whitespace does not include the VT  char-
       acter  (code  11).  In  addition,  characters  between  an
       unescaped # outside a character class and the next newline
       character, inclusive, are also ignored. This is equivalent
       to Perl's /x option, and it can be changed within  a  pat-
       tern by a (?x) option setting.

       This  option  makes it possible to include comments inside
       complicated patterns.  Note, however,  that  this  applies
       only  to  data characters. Whitespace characters may never
       appear within special character sequences  in  a  pattern,
       for  example  within  the  sequence (?( which introduces a
       conditional subpattern.

         PCRE_EXTRA

       This option was invented in order to  turn  on  additional
       functionality  of PCRE that is incompatible with Perl, but
       it is currently of very little use. When  set,  any  back-
       slash  in  a pattern that is followed by a letter that has
       no special meaning causes an error, thus  reserving  these
       combinations for future expansion. By default, as in Perl,
       a backslash followed by a letter with no  special  meaning
       is  treated  as  a  literal. There are at present no other
       features controlled by this option. It can also be set  by
       a (?X) option setting within a pattern.

         PCRE_MULTILINE

       By  default,  PCRE treats the subject string as consisting
       of a single "line" of characters (even if it actually con-
       tains several newlines). The "start of line" metacharacter
       (^) matches only at the start of  the  string,  while  the
       "end of line" metacharacter ($) matches only at the end of
       the  string,  or  before  a  terminating  newline  (unless
       PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.

       When  PCRE_MULTILINE  it  is  set, the "start of line" and
       "end of line" constructs match  immediately  following  or
       immediately  before  any  newline  in  the subject string,
       respectively, as well as at the very start and  end.  This
       is  equivalent  to Perl's /m option, and it can be changed
       within a pattern by a (?m) option setting. If there are no
       "\n"  characters in a subject string, or no occurrences of
       ^ or $ in a pattern, setting PCRE_MULTILINE has no effect.

         PCRE_NO_AUTO_CAPTURE

       If  this  option  is  set, it disables the use of numbered
       capturing parentheses in the pattern. Any  opening  paren-
       thesis  that  is  not  followed by ? behaves as if it were
       followed by ?: but named parentheses can still be used for
       capturing  (and  they  acquire  numbers in the usual way).
       There is no equivalent of this option in Perl.

         PCRE_UNGREEDY

       This option inverts the "greediness" of the quantifiers so
       that  they are not greedy by default, but become greedy if
       followed by "?". It is not compatible with  Perl.  It  can
       also be set by a (?U) option setting within the pattern.

         PCRE_UTF8

       This option causes PCRE to regard both the pattern and the
       subject as strings of UTF-8 characters instead of  single-
       byte  character  strings. However, it is available only if
       PCRE has been built to include UTF-8 support. If not,  the
       use  of this option provokes an error. Details of how this
       option changes the behaviour of PCRE are given in the sec-
       tion on UTF-8 support in the main pcre page.

         PCRE_NO_UTF8_CHECK

       When  PCRE_UTF8  is  set, the validity of the pattern as a
       UTF-8 string is automatically checked. If an invalid UTF-8
       sequence  of  bytes  is  found,  pcre_compile() returns an
       error. If you already know that your pattern is valid, and
       you  want  to skip this check for performance reasons, you
       can set the PCRE_NO_UTF8_CHECK option. When it is set, the
       effect  of passing an invalid UTF-8 string as a pattern is
       undefined. It may cause your program to crash.  Note  that
       there  is a similar option for suppressing the checking of
       subject strings passed to pcre_exec().



STUDYING A PATTERN

       pcre_extra *pcre_study(const pcre *code, int options,
            const char **errptr);

       When a pattern is going to be used several  times,  it  is
       worth spending more time analyzing it in order to speed up
       the time taken for  matching.  The  function  pcre_study()
       takes  a  pointer to a compiled pattern as its first argu-
       ment. If studing the pattern produces additional  informa-
       tion  that  will  help  speed  up  matching,  pcre_study()
       returns a pointer to a  pcre_extra  block,  in  which  the
       study_data field points to the results of the study.

       The  returned  value  from  a  pcre_study()  can be passed
       directly to pcre_exec().  However,  the  pcre_extra  block
       also  contains  other fields that can be set by the caller
       before the block is passed; these are described below.  If
       studying  the  pattern  does  not  produce  any additional
       information, pcre_study() returns NULL.  In  that  circum-
       stance,  if  the calling program wants to pass some of the
       other fields to  pcre_exec(),  it  must  set  up  its  own
       pcre_extra block.

       The  second  argument contains option bits. At present, no
       options are defined for pcre_study(),  and  this  argument
       should always be zero.

       The  third  argument  for pcre_study() is a pointer for an
       error message. If studying succeeds (even if  no  data  is
       returned),  the variable it points to is set to NULL. Oth-
       erwise it points to a textual error  message.  You  should
       therefore  test  the  error pointer for NULL after calling
       pcre_study(), to be sure that it has run successfully.

       This is a typical call to pcre_study():

         pcre_extra *pe;
         pe = pcre_study(
           re,             /* result of pcre_compile() */
           0,              /* no options exist */
           &error);        /* set to NULL or points to a  message
       */

       At  present,  studying  a  pattern is useful only for non-
       anchored patterns that do not have a single fixed starting
       character.  A  bitmap  of  possible starting characters is
       created.


LOCALE SUPPORT

       PCRE handles caseless  matching,  and  determines  whether
       characters  are letters, digits, or whatever, by reference
       to a set of tables.  When  running  in  UTF-8  mode,  this
       applies  only  to characters with codes less than 256. The
       library contains a default set of tables that  is  created
       in  the  default  C  locale when PCRE is compiled. This is
       used when the final argument of  pcre_compile()  is  NULL,
       and is sufficient for many applications.

       An  alternative  set  of tables can, however, be supplied.
       Such tables are built  by  calling  the  pcre_maketables()
       function,  which has no arguments, in the relevant locale.
       The result can then be passed to pcre_compile()  as  often
       as  necessary.  For  example, to build and use tables that
       are appropriate for  the  French  locale  (where  accented
       characters with codes greater than 128 are treated as let-
       ters), the following code could be used:

         setlocale(LC_CTYPE, "fr");
         tables = pcre_maketables();
         re = pcre_compile(..., tables);

       The tables are  built  in  memory  that  is  obtained  via
       pcre_malloc. The pointer that is passed to pcre_compile is
       saved with the compiled pattern, and the same  tables  are
       used  via  this  pointer  by pcre_study() and pcre_exec().
       Thus, for any single pattern,  compilation,  studying  and
       matching all happen in the same locale, but different pat-
       terns can be compiled in  different  locales.  It  is  the
       caller's responsibility to ensure that the memory contain-
       ing the tables remains available for  as  long  as  it  is
       needed.


INFORMATION ABOUT A PATTERN

       int   pcre_fullinfo(const  pcre  *code,  const  pcre_extra
       *extra,
            int what, void *where);

       The pcre_fullinfo() function returns information  about  a
       compiled  pattern.  It  replaces  the obsolete pcre_info()
       function, which is  nevertheless  retained  for  backwards
       compability (and is documented below).

       The first argument for pcre_fullinfo() is a pointer to the
       compiled pattern. The second argument  is  the  result  of
       pcre_study(),  or NULL if the pattern was not studied. The
       third argument specifies which  piece  of  information  is
       required,  and the fourth argument is a pointer to a vari-
       able to receive the data. The yield  of  the  function  is
       zero  for  success,  or one of the following negative num-
       bers:

         PCRE_ERROR_NULL       the argument code was NULL
                               the argument where was NULL
         PCRE_ERROR_BADMAGIC   the "magic number" was not found
         PCRE_ERROR_BADOPTION  the value of what was invalid

       Here is a typical call of pcre_fullinfo(), to  obtain  the
       length of the compiled pattern:

         int rc;
         unsigned long int length;
         rc = pcre_fullinfo(
           re,               /* result of pcre_compile() */
           pe,                /*  result of pcre_study(), or NULL
       */
           PCRE_INFO_SIZE,   /* what is required */
           &length);         /* where to put the data */

       The possible values for the third argument are defined  in
       pcre.h, and are as follows:

         PCRE_INFO_BACKREFMAX

       Return  the  number  of  the highest back reference in the
       pattern. The fourth argument should point to an int  vari-
       able. Zero is returned if there are no back references.

         PCRE_INFO_CAPTURECOUNT

       Return the number of capturing subpatterns in the pattern.
       The fourth argument should point to an int variable.

         PCRE_INFO_FIRSTBYTE

       Return information about the first  byte  of  any  matched
       string,  for  a non-anchored pattern. (This option used to
       be called PCRE_INFO_FIRSTCHAR; the old name is still  rec-
       ognized for backwards compatibility.)

       If  there  is a fixed first byte, e.g. from a pattern such
       as (cat|cow|coyote), it is returned in the integer pointed
       to by where. Otherwise, if either

       (a)  the  pattern  was  compiled  with  the PCRE_MULTILINE
       option, and every branch starts with "^", or

       (b) every branch of  the  pattern  starts  with  ".*"  and
       PCRE_DOTALL  is not set (if it were set, the pattern would
       be anchored),

       -1 is returned, indicating that the pattern  matches  only
       at  the  start  of  a  subject string or after any newline
       within the string. Otherwise -2 is returned. For  anchored
       patterns, -2 is returned.

         PCRE_INFO_FIRSTTABLE

       If  the pattern was studied, and this resulted in the con-
       struction of a 256-bit table indicating  a  fixed  set  of
       bytes for the first byte in any matching string, a pointer
       to the table is returned. Otherwise NULL is returned.  The
       fourth  argument  should point to an unsigned char * vari-
       able.

         PCRE_INFO_LASTLITERAL

       Return the value of the rightmost literal byte  that  must
       exist  in  any matched string, other than at its start, if
       such a byte has been recorded. The fourth argument  should
       point  to an int variable. If there is no such byte, -1 is
       returned. For anchored patterns, a last  literal  byte  is
       recorded  only if it follows something of variable length.
       For example, for  the  pattern  /^a\d+z\d+/  the  returned
       value  is "z", but for /^a\dz\d/ the returned value is -1.

         PCRE_INFO_NAMECOUNT
         PCRE_INFO_NAMEENTRYSIZE
         PCRE_INFO_NAMETABLE

       PCRE supports the use of named as well as numbered captur-
       ing  parentheses.  The names are just an additional way of
       identifying the parentheses, which still acquire a number.
       A  caller  that wants to extract data from a named subpat-
       tern must convert the name to a number in order to  access
       the  correct pointers in the output vector (described with
       pcre_exec() below). In order to do this, it must first use
       these  three  values  to obtain the name-to-number mapping
       table for the pattern.

       The map  consists  of  a  number  of  fixed-size  entries.
       PCRE_INFO_NAMECOUNT  gives  the  number  of  entries,  and
       PCRE_INFO_NAMEENTRYSIZE gives the size of each entry; both
       of  these  return  an int value. The entry size depends on
       the  length  of  the  longest  name.   PCRE_INFO_NAMETABLE
       returns  a  pointer  to  the  first  entry of the table (a
       pointer to char). The first two bytes of  each  entry  are
       the  number of the capturing parenthesis, most significant
       byte first. The rest of the  entry  is  the  corresponding
       name,  zero  terminated.  The  names  are  in alphabetical
       order. For example, consider the following pattern (assume
       PCRE_EXTENDED  is set, so white space - including newlines
       - is ignored):

         (?P (?P(\d\d)?\d\d) -
         (?P\d\d) - (?P\d\d) )

       There are four named subpatterns, so the  table  has  four
       entries,  and each entry in the table is eight bytes long.
       The table is as follows, with non-printing bytes shows  in
       hex, and undefined bytes shown as ??:

         00 01 d  a  t  e  00 ??
         00 05 d  a  y  00 ?? ??
         00 04 m  o  n  t  h  00
         00 02 y  e  a  r  00 ??

       When  writing code to extract data from named subpatterns,
       remember that the length of each entry  may  be  different
       for each compiled pattern.

         PCRE_INFO_OPTIONS

       Return  a  copy  of the options with which the pattern was
       compiled. The fourth argument should point to an  unsigned
       long  int  variable. These option bits are those specified
       in the call to pcre_compile(), modified by  any  top-level
       option settings within the pattern itself.

       A  pattern is automatically anchored by PCRE if all of its
       top-level alternatives begin with one of the following:

         ^     unless PCRE_MULTILINE is set
         \A    always
         \G    always
         .*    if PCRE_DOTALL is set and there are no back
                 references to the subpattern in which .* appears

       For  such  patterns,  the  PCRE_ANCHORED bit is set in the
       options returned by pcre_fullinfo().

         PCRE_INFO_SIZE

       Return the size of the  compiled  pattern,  that  is,  the
       value  that  was  passed  as the argument to pcre_malloc()
       when PCRE was getting memory in which to  place  the  com-
       piled  data.  The fourth argument should point to a size_t
       variable.

         PCRE_INFO_STUDYSIZE

       Returns the size of the  data  block  pointed  to  by  the
       study_data field in a pcre_extra block. That is, it is the
       value that was passed to pcre_malloc() when PCRE was  get-
       ting  memory  into  which  to  place  the  data created by
       pcre_study(). The fourth argument should point to a size_t
       variable.


OBSOLETE INFO FUNCTION

       int   pcre_info(const   pcre   *code,   int  *optptr,  int
       *firstcharptr);

       The pcre_info()  function  is  now  obsolete  because  its
       interface  is  too restrictive to return all the available
       data about a compiled pattern.  New  programs  should  use
       pcre_fullinfo()  instead.  The yield of pcre_info() is the
       number of capturing subpatterns, or one of  the  following
       negative numbers:

         PCRE_ERROR_NULL       the argument code was NULL
         PCRE_ERROR_BADMAGIC   the "magic number" was not found

       If  the optptr argument is not NULL, a copy of the options
       with which the pattern was compiled is placed in the inte-
       ger it points to (see PCRE_INFO_OPTIONS above).

       If  the pattern is not anchored and the firstcharptr argu-
       ment is not NULL, it is  used  to  pass  back  information
       about  the  first  character  of  any  matched string (see
       PCRE_INFO_FIRSTBYTE above).


MATCHING A PATTERN

       int pcre_exec(const pcre *code, const pcre_extra *extra,
            const char *subject, int length, int startoffset,
            int options, int *ovector, int ovecsize);

       The function pcre_exec() is  called  to  match  a  subject
       string  against a pre-compiled pattern, which is passed in
       the code argument. If the pattern has  been  studied,  the
       result  of  the  study should be passed in the extra argu-
       ment.

       Here is an example of a simple call to pcre_exec():

         int rc;
         int ovector[30];
         rc = pcre_exec(
           re,             /* result of pcre_compile() */
           NULL,           /* we didn't study the pattern */
           "some string",  /* the subject string */
           11,             /* the length of the subject string */
           0,              /* start at offset 0 in the subject */
           0,              /* default options */
           ovector,        /* vector for substring information */
           30);            /* number of elements in the vector */

       If the extra argument is not NULL,  it  must  point  to  a
       pcre_extra  data  block. The pcre_study() function returns
       such a block (when it doesn't return NULL),  but  you  can
       also create one for yourself, and pass additional informa-
       tion in it. The fields in the block are as follows:

         unsigned long int flags;
         void *study_data;
         unsigned long int match_limit;
         void *callout_data;

       The flags field is a bitmap that specifies  which  of  the
       other fields are set. The flag bits are:

         PCRE_EXTRA_STUDY_DATA
         PCRE_EXTRA_MATCH_LIMIT
         PCRE_EXTRA_CALLOUT_DATA

       Other  flag  bits  should  be  set to zero. The study_data
       field is set in the pcre_extra block that is  returned  by
       pcre_study(),  together with the appropriate flag bit. You
       should not set this yourself, but you can add to the block
       by setting the other fields.

       The  match_limit field provides a means of preventing PCRE
       from using up a vast amount of resources when running pat-
       terns  that  are not going to match, but which have a very
       large number of possibilities in their search  trees.  The
       classic  example  is  the use of nested unlimited repeats.
       Internally, PCRE uses a function called match()  which  it
       calls  repeatedly  (sometimes  recursively).  The limit is
       imposed on the number of times  this  function  is  called
       during  a  match,  which  has  the  effect of limiting the
       amount of recursion and backtracking that can take  place.
       For  patterns that are not anchored, the count starts from
       zero for each position in the subject string.

       The default limit for the library can be set when PCRE  is
       built;  the  default  default is 10 million, which handles
       all but the most extreme cases. You can reduce the default
       by  suppling  pcre_exec() with a pcre_extra block in which
       match_limit   is   set   to   a   smaller    value,    and
       PCRE_EXTRA_MATCH_LIMIT  is  set in the flags field. If the
       limit is exceeded, pcre_exec()  returns  PCRE_ERROR_MATCH-
       LIMIT.

       The  pcre_callout  field  is  used in conjunction with the
       "callout" feature, which is described in  the  pcrecallout
       documentation.

       The  PCRE_ANCHORED  option  can  be  passed in the options
       argument, whose unused bits  must  be  zero.  This  limits
       pcre_exec()  to  matching  at the first matching position.
       However, if a pattern was compiled with PCRE_ANCHORED,  or
       turned  out  to  be anchored by virtue of its contents, it
       cannot be made unachored at matching time.

       When PCRE_UTF8 was set at compile time,  the  validity  of
       the  subject  as  a UTF-8 string is automatically checked,
       and the value of startoffset is  also  checked  to  ensure
       that  it  points  to the start of a UTF-8 character. If an
       invalid UTF-8 sequence  of  bytes  is  found,  pcre_exec()
       returns   the  error  PCRE_ERROR_BADUTF8.  If  startoffset
       contains an invalid  value,  PCRE_ERROR_BADUTF8_OFFSET  is
       returned.

       If  you  already  know that your subject is valid, and you
       want to skip these checks for performance reasons, you can
       set    the    PCRE_NO_UTF8_CHECK   option   when   calling
       pcre_exec(). You might want to do this for the second  and
       subsequent calls to pcre_exec() if you are making repeated
       calls to find all the matches in a single subject  string.
       However,  you should be sure that the value of startoffset
       points  to  the  start  of   a   UTF-8   character.   When
       PCRE_NO_UTF8_CHECK  is  set,  the  effect  of  passing  an
       invalid UTF-8 string as a subject, or a value of startoff-
       set that does not point to the start of a UTF-8 character,
       is undefined. Your program may crash.

       There are also three further options that can be set  only
       at matching time:

         PCRE_NOTBOL

       The  first character of the string is not the beginning of
       a line, so the circumflex metacharacter should  not  match
       before it. Setting this without PCRE_MULTILINE (at compile
       time) causes circumflex never to match.

         PCRE_NOTEOL

       The end of the string is not the end of  a  line,  so  the
       dollar  metacharacter  should  not match it nor (except in
       multiline mode) a newline immediately before  it.  Setting
       this  without PCRE_MULTILINE (at compile time) causes dol-
       lar never to match.

         PCRE_NOTEMPTY

       An empty string is not considered to be a valid  match  if
       this  option is set. If there are alternatives in the pat-
       tern, they are tried. If all the  alternatives  match  the
       empty  string, the entire match fails. For example, if the
       pattern

         a?b?

       is applied to a string not beginning with "a" or  "b",  it
       matches the empty string at the start of the subject. With
       PCRE_NOTEMPTY set,  this  match  is  not  valid,  so  PCRE
       searches further into the string for occurrences of "a" or
       "b".

       Perl has no direct equivalent  of  PCRE_NOTEMPTY,  but  it
       does  make  a special case of a pattern match of the empty
       string within its split() function, and when using the  /g
       modifier. It is possible to emulate Perl's behaviour after
       matching a null string by first trying the match again  at
       the  same  offset with PCRE_NOTEMPTY set, and then if that
       fails by advancing the starting  offset  (see  below)  and
       trying an ordinary match again.

       The  subject  string is passed to pcre_exec() as a pointer
       in subject, a length in length, and a starting byte offset
       in startoffset. Unlike the pattern string, the subject may
       contain binary zero bytes. When  the  starting  offset  is
       zero,  the  search  for a match starts at the beginning of
       the subject, and this is by far the most common case.

       If the pattern was compiled with the PCRE_UTF8 option, the
       subject  must be a sequence of bytes that is a valid UTF-8
       string, and the starting offset must point to  the  begin-
       ning  of  a UTF-8 character. If an invalid UTF-8 string or
       offset is passed, an error (either  PCRE_ERROR_BADUTF8  or
       PCRE_ERROR_BADUTF8_OFFSET)  is returned, unless the option
       PCRE_NO_UTF8_CHECK is set, in which case PCRE's  behaviour
       is not defined.

       A  non-zero  starting  offset is useful when searching for
       another match in the same subject by  calling  pcre_exec()
       again  after a previous success.  Setting startoffset dif-
       fers from just passing over a shortened string and setting
       PCRE_NOTBOL  in the case of a pattern that begins with any
       kind of lookbehind. For example, consider the pattern

         \Biss\B

       which finds occurrences of "iss" in the middle  of  words.
       (\B matches only if the current position in the subject is
       not a word boundary.) When applied to the string  "Missis-
       sipi" the first call to pcre_exec() finds the first occur-
       rence. If  pcre_exec()  is  called  again  with  just  the
       remainder  of  the  subject,  namely "issipi", it does not
       match, because \B is always false at the start of the sub-
       ject,  which  is deemed to be a word boundary. However, if
       pcre_exec() is passed the entire string  again,  but  with
       startoffset  set  to  4, it finds the second occurrence of
       "iss" because it is able to look behind the starting point
       to discover that it is preceded by a letter.

       If  a  non-zero starting offset is passed when the pattern
       is anchored, one attempt to match at the given  offset  is
       tried.  This  can  only  succeed  if  the pattern does not
       require the match to be at the start of the subject.

       In general, a pattern matches a  certain  portion  of  the
       subject, and in addition, further substrings from the sub-
       ject may be picked out by parts of the pattern.  Following
       the  usage  in Jeffrey Friedl's book, this is called "cap-
       turing" in what follows, and the phrase "capturing subpat-
       tern" is used for a fragment of a pattern that picks out a
       substring. PCRE supports several other kinds of  parenthe-
       sized  subpattern  that do not cause substrings to be cap-
       tured.

       Captured substrings are returned to the caller via a  vec-
       tor of integer offsets whose address is passed in ovector.
       The number of elements in the vector is  passed  in  ovec-
       size.  The  first two-thirds of the vector is used to pass
       back captured substrings, each substring using a  pair  of
       integers.  The  remaining  third  of the vector is used as
       workspace by pcre_exec() while matching capturing  subpat-
       terns,  and is not available for passing back information.
       The length passed in ovecsize should always be a  multiple
       of three. If it is not, it is rounded down.

       When  a  match has been successful, information about cap-
       tured substrings is returned in pairs of integers,  start-
       ing at the beginning of ovector, and continuing up to two-
       thirds of its length at the most. The first element  of  a
       pair is set to the offset of the first character in a sub-
       string, and the second is set to the offset of  the  first
       character  after  the  end of a substring. The first pair,
       ovector[0] and ovector[1], identify  the  portion  of  the
       subject  string  matched  by  the entire pattern. The next
       pair is used for the first capturing  subpattern,  and  so
       on.  The  value  returned  by pcre_exec() is the number of
       pairs that have been set. If there are no  capturing  sub-
       patterns,  the  return value from a successful match is 1,
       indicating that just the first pair of  offsets  has  been
       set.

       Some convenience functions are provided for extracting the
       captured  substrings  as  separate  strings.   These   are
       described in the following section.

       It  is  possible for an capturing subpattern number n+1 to
       match some part of the subject when subpattern n  has  not
       been  used  at  all.  For  example, if the string "abc" is
       matched against the pattern (a|(z))(bc) subpatterns 1  and
       3  are matched, but 2 is not. When this happens, both off-
       set values corresponding to the unused subpattern are  set
       to -1.

       If a capturing subpattern is matched repeatedly, it is the
       last portion of the  string  that  it  matched  that  gets
       returned.

       If  the  vector is too small to hold all the captured sub-
       strings, it is used as far as possible (up  to  two-thirds
       of  its length), and the function returns a value of zero.
       In particular, if the substring offsets are not of  inter-
       est, pcre_exec() may be called with ovector passed as NULL
       and ovecsize as zero. However,  if  the  pattern  contains
       back references and the ovector isn't big enough to remem-
       ber the related substrings, PCRE  has  to  get  additional
       memory  for use during matching. Thus it is usually advis-
       able to supply an ovector.

       Note that pcre_info() can be used to  find  out  how  many
       capturing subpatterns there are in a compiled pattern. The
       smallest size for ovector that will allow for  n  captured
       substrings,  in  addition  to the offsets of the substring
       matched by the whole pattern, is (n+1)*3.

       If pcre_exec() fails, it returns a  negative  number.  The
       following are defined in the header file:

         PCRE_ERROR_NOMATCH        (-1)

       The subject string did not match the pattern.

         PCRE_ERROR_NULL           (-2)

       Either  code or subject was passed as NULL, or ovector was
       NULL and ovecsize was not zero.

         PCRE_ERROR_BADOPTION      (-3)

       An unrecognized bit was set in the options argument.

         PCRE_ERROR_BADMAGIC       (-4)

       PCRE stores a 4-byte "magic number" at the  start  of  the
       compiled  code, to catch the case when it is passed a junk
       pointer. This is the error it gives when the magic  number
       isn't present.

         PCRE_ERROR_UNKNOWN_NODE   (-5)

       While  running  the  pattern  match,  an  unknown item was
       encountered in the compiled pattern. This error  could  be
       caused  by a bug in PCRE or by overwriting of the compiled
       pattern.

         PCRE_ERROR_NOMEMORY       (-6)

       If a pattern contains back  references,  but  the  ovector
       that  is passed to pcre_exec() is not big enough to remem-
       ber the referenced substrings, PCRE gets a block of memory
       at  the  start of matching to use for this purpose. If the
       call via pcre_malloc() fails, this  error  is  given.  The
       memory is freed at the end of matching.

         PCRE_ERROR_NOSUBSTRING    (-7)

       This   error   is   used   by  the  pcre_copy_substring(),
       pcre_get_substring(), and pcre_get_substring_list()  func-
       tions (see below). It is never returned by pcre_exec().

         PCRE_ERROR_MATCHLIMIT     (-8)

       The  recursion and backtracking limit, as specified by the
       match_limit field in a pcre_extra structure (or defaulted)
       was reached. See the description above.

         PCRE_ERROR_CALLOUT        (-9)

       This error is never generated by pcre_exec() itself. It is
       provided for use by callout functions that want to yield a
       distinctive  error code. See the pcrecallout documentation
       for details.

         PCRE_ERROR_BADUTF8        (-10)

       A string that contains an invalid UTF-8 byte sequence  was
       passed as a subject.

         PCRE_ERROR_BADUTF8_OFFSET (-11)

       The  UTF-8  byte sequence that was passed as a subject was
       valid, but the value of startoffset did not point  to  the
       beginning of a UTF-8 character.


EXTRACTING CAPTURED SUBSTRINGS BY NUMBER

       int pcre_copy_substring(const char *subject, int *ovector,
            int stringcount, int stringnumber, char *buffer,
            int buffersize);

       int pcre_get_substring(const char *subject, int *ovector,
            int stringcount, int stringnumber,
            const char **stringptr);

       int pcre_get_substring_list(const char *subject,
            int   *ovector,   int   stringcount,    const    char
       ***listptr);

       Captured  substrings can be accessed directly by using the
       offsets returned by pcre_exec()  in  ovector.  For  conve-
       nience, the functions pcre_copy_substring(), pcre_get_sub-
       string(), and pcre_get_substring_list() are  provided  for
       extracting captured substrings as new, separate, zero-ter-
       minated strings. These functions  identify  substrings  by
       number.  The next section describes functions for extract-
       ing named substrings. A substring that contains  a  binary
       zero  is  correctly extracted and has a further zero added
       on the end, but the result is not, of course, a C  string.

       The  first  three  arguments are the same for all three of
       these functions: subject is the subject string  which  has
       just  been  successfully  matched, ovector is a pointer to
       the  vector  of  integer  offsets  that  was   passed   to
       pcre_exec(),  and  stringcount is the number of substrings
       that were captured by the match, including  the  substring
       that  matched  the  entire regular expression. This is the
       value returned by pcre_exec if it is greater than zero. If
       pcre_exec()  returned  zero, indicating that it ran out of
       space in ovector, the value passed as  stringcount  should
       be the size of the vector divided by three.

       The   functions  pcre_copy_substring()  and  pcre_get_sub-
       string() extract a single substring, whose number is given
       as  stringnumber.  A  value of zero extracts the substring
       that matched  the  entire  pattern,  while  higher  values
       extract   the   captured  substrings.  For  pcre_copy_sub-
       string(), the string is placed in buffer, whose length  is
       given  by buffersize, while for pcre_get_substring() a new
       block of memory  is  obtained  via  pcre_malloc,  and  its
       address  is returned via stringptr. The yield of the func-
       tion is the length of the string, not including the termi-
       nating zero, or one of

         PCRE_ERROR_NOMEMORY       (-6)

       The buffer was too small for pcre_copy_substring(), or the
       attempt to get memory failed for pcre_get_substring().

         PCRE_ERROR_NOSUBSTRING    (-7)

       There is no substring whose number is stringnumber.

       The pcre_get_substring_list() function extracts all avail-
       able substrings and builds a list of pointers to them. All
       this is done in a single block of memory which is obtained
       via  pcre_malloc.  The  address  of  the  memory  block is
       returned via listptr, which is also the start of the  list
       of  string  pointers.  The  end of the list is marked by a
       NULL pointer. The yield of the function  is  zero  if  all
       went well, or

         PCRE_ERROR_NOMEMORY       (-6)

       if the attempt to get the memory block failed.

       When  any of these functions encounter a substring that is
       unset, which can happen when capturing  subpattern  number
       n+1 matches some part of the subject, but subpattern n has
       not been used at all, they return an  empty  string.  This
       can  be distinguished from a genuine zero-length substring
       by inspecting the appropriate offset in ovector, which  is
       negative for unset substrings.

       The  two  convenience  functions pcre_free_substring() and
       pcre_free_substring_list() can be used to free the  memory
       returned  by  a  previous  call of pcre_get_substring() or
       pcre_get_substring_list(), respectively. They  do  nothing
       more than call the function pointed to by pcre_free, which
       of course could be called directly from a C program.  How-
       ever,  PCRE  is used in some situations where it is linked
       via a special interface to  another  programming  language
       which cannot use pcre_free directly; it is for these cases
       that the functions are provided.


EXTRACTING CAPTURED SUBSTRINGS BY NAME

       int pcre_copy_named_substring(const pcre *code,
            const char *subject, int *ovector,
            int stringcount, const char *stringname,
            char *buffer, int buffersize);

       int pcre_get_stringnumber(const pcre *code,
            const char *name);

       int pcre_get_named_substring(const pcre *code,
            const char *subject, int *ovector,
            int stringcount, const char *stringname,
            const char **stringptr);

       To extract a substring by name, you  first  have  to  find
       associated   number.   This   can   be   done  by  calling
       pcre_get_stringnumber(). The first argument  is  the  com-
       piled  pattern,  and  the second is the name. For example,
       for this pattern

         ab(?\d+)...

       the number of the subpattern called "xxx" is 1. Given  the
       number,  you  can  then extract the substring directly, or
       use one of the functions described in  the  previous  sec-
       tion.  For  convenience, there are also two functions that
       do the whole job.

       Most of the arguments of  pcre_copy_named_substring()  and
       pcre_get_named_substring()  are  the same as those for the
       functions that extract by  number,  and  so  are  not  re-
       described here. There are just two differences.

       First,  instead of a substring number, a substring name is
       given. Second, there is an extra argument,  given  at  the
       start, which is a pointer to the compiled pattern. This is
       needed in order  to  gain  access  to  the  name-to-number
       translation table.

       These  functions  call  pcre_get_stringnumber(), and if it
       succeeds,  they   then   call   pcre_copy_substring()   or
       pcre_get_substring(), as appropriate.

Last updated: 09 December 2003
Copyright (c) 1997-2003 University of Cambridge.



                                                          PCRE(3)

Interix / SUAHosted at SUA Community for Interix, SUA and SFUInterix / SUA