Index of Section 1 Manual Pages

Interix / SUApcretest.1Interix / SUA

PCRETEST(1)                                           PCRETEST(1)



NAME
       pcretest  -  a program for testing Perl-compatible regular
       expressions.

SYNOPSIS
       pcretest [-d] [-i] [-m]  [-o  osize]  [-p]  [-t]  [source]
       [destination]

       pcretest  was written as a test program for the PCRE regu-
       lar expression library itself, but it can also be used for
       experimenting  with  regular  expressions.  This  document
       describes the features of the test program; for details of
       the  regular  expressions  themselves, see the pcrepattern
       documentation. For details of PCRE and  its  options,  see
       the pcreapi documentation.


OPTIONS


       -C        Output  the  version number of the PCRE library,
                 and all available information about the optional
                 features that are included, and then exit.

       -d        Behave as if each regex had the /D modifier (see
                 below); the internal form is output after compi-
                 lation.

       -i        Behave  as  if  each  regex had the /I modifier;
                 information about the compiled pattern is  given
                 after compilation.

       -m        Output  the  size of each compiled pattern after
                 it has been  compiled.  This  is  equivalent  to
                 adding  /M  to each regular expression. For com-
                 patibility with earlier versions of pcretest, -s
                 is a synonym for -m.

       -o osize  Set  the number of elements in the output vector
                 that is used when calling PCRE to be osize.  The
                 default value is 45, which is enough for 14 cap-
                 turing subexpressions. The vector  size  can  be
                 changed for individual matching calls by includ-
                 ing \O in the data line (see below).

       -p        Behave as if each regex  has  /P  modifier;  the
                 POSIX  wrapper API is used to call PCRE. None of
                 the other options has any effect when -p is set.

       -t        Run  each  compile,  study, and match many times
                 with a timer, and output resulting time per com-
                 pile  or  match (in milliseconds). Do not set -t
                 with -m, because you will then get the size out-
                 put  20000  times  and  the  timing will be dis-
                 torted.


DESCRIPTION

       If pcretest is given two filename arguments, it reads from
       the  first  and  writes to the second. If it is given only
       one filename argument, it reads from that file and  writes
       to  stdout.  Otherwise,  it reads from stdin and writes to
       stdout, and prompts for each line of input, using "re>" to
       prompt  for regular expressions, and "data>" to prompt for
       data lines.

       The program handles any number of sets of input on a  sin-
       gle input file. Each set starts with a regular expression,
       and continues with any number of data lines to be  matched
       against the pattern.

       Each  line is matched separately and independently. If you
       want to do multiple-line matches, you have to use  the  \n
       escape  sequence  in  a single line of input to encode the
       newline characters. The maximum length  of  data  line  is
       30,000 characters.

       An  empty line signals the end of the data lines, at which
       point a  new  regular  expression  is  read.  The  regular
       expressions  are  given  enclosed  in  any  non-alphameric
       delimiters other than backslash, for example

         /(a|bc)x+yz/

       White space before the initial  delimiter  is  ignored.  A
       regular  expression  may  be  continued over several input
       lines, in which case the newline characters  are  included
       within  it. It is possible to include the delimiter within
       the pattern by escaping it, for example

         /abc\/def/

       If you do so, the escape and the delimiter  form  part  of
       the   pattern,   but  since  delimiters  are  always  non-
       alphameric, this does not affect its  interpretation.   If
       the  terminating  delimiter  is  immediately followed by a
       backslash, for example,

         /abc/\

       then a backslash is added to the end of the pattern.  This
       is  done  to  provide a way of testing the error condition
       that arises  if  a  pattern  finishes  with  a  backslash,
       because

         /abc\/

       is  interpreted as the first line of a pattern that starts
       with "abc/", causing pcretest to read the next line  as  a
       continuation of the regular expression.


PATTERN MODIFIERS

       The  pattern  may  be followed by i, m, s, or x to set the
       PCRE_CASELESS,     PCRE_MULTILINE,     PCRE_DOTALL,     or
       PCRE_EXTENDED options, respectively. For example:

         /caseless/i

       These  modifier letters have the same effect as they do in
       Perl. There are others that set PCRE options that  do  not
       correspond to anything in Perl: /A, /E, /N, /U, and /X set
       PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY,  PCRE_NO_AUTO_CAPTURE,
       PCRE_UNGREEDY, and PCRE_EXTRA respectively.

       Searching  for  all  possible  matches within each subject
       string can be requested by the /g or  /G  modifier.  After
       finding  a  match,  PCRE  is  called  again  to search the
       remainder of the subject string. The difference between /g
       and /G is that the former uses the startoffset argument to
       pcre_exec() to start searching at a new point  within  the
       entire string (which is in effect what Perl does), whereas
       the latter passes over a shortened substring. This makes a
       difference  to  the matching process if the pattern begins
       with a lookbehind assertion (including \b or \B).

       If any call to pcre_exec() in a /g or /G sequence  matches
       an   empty   string,  the  next  call  is  done  with  the
       PCRE_NOTEMPTY and PCRE_ANCHORED  flags  set  in  order  to
       search  for  another,  non-empty, match at the same point.
       If this second match fails, the start offset  is  advanced
       by one, and the normal match is retried. This imitates the
       way Perl handles such cases when using the /g modifier  or
       the split() function.

       There  are a number of other modifiers for controlling the
       way pcretest operates.

       The /+ modifier requests that as well  as  outputting  the
       substring that matched the entire pattern, pcretest should
       in addition output the remainder of  the  subject  string.
       This is useful for tests where the subject contains multi-
       ple copies of the same substring.

       The /L modifier must be followed directly by the name of a
       locale, for example,

         /pattern/Lfr

       For  this reason, it must be the last modifier letter. The
       given locale is set, pcre_maketables() is called to  build
       a set of character tables for the locale, and this is then
       passed  to  pcre_compile()  when  compiling  the   regular
       expression.  Without an /L modifier, NULL is passed as the
       tables pointer; that is, /L applies only to the expression
       on which it appears.

       The  /I modifier requests that pcretest output information
       about the compiled expression (whether it is anchored, has
       a fixed first character, and so on). It does this by call-
       ing pcre_fullinfo() after  compiling  an  expression,  and
       outputting the information it gets back. If the pattern is
       studied, the results of that are also output.

       The /D modifier is a PCRE debugging  feature,  which  also
       assumes /I.  It causes the internal form of compiled regu-
       lar expressions to be output  after  compilation.  If  the
       pattern was studied, the information returned is also out-
       put.

       The /S modifier causes pcre_study() to be called after the
       expression  has  been  compiled, and the results used when
       the expression is matched.

       The /M modifier causes the size of memory  block  used  to
       hold the compiled pattern to be output.

       The /P modifier causes pcretest to call PCRE via the POSIX
       wrapper API rather than its native API. When this is done,
       all  other  modifiers  except  /i, /m, and /+ are ignored.
       REG_ICASE is set if /i is present, and REG_NEWLINE is  set
       if   /m   is   present.   The   wrapper   functions  force
       PCRE_DOLLAR_ENDONLY   always,   and   PCRE_DOTALL   unless
       REG_NEWLINE is set.

       The  /8  modifier  causes  pcretest  to call PCRE with the
       PCRE_UTF8 option set. This  turns  on  support  for  UTF-8
       character  handling in PCRE, provided that it was compiled
       with this support enabled. This modifier also  causes  any
       non-printing  characters  in  output strings to be printed
       using the \x{hh...}  notation  if  they  are  valid  UTF-8
       sequences.

       If  the /? modifier is used with /8, it causes pcretest to
       call pcre_compile() with the PCRE_NO_UTF8_CHECK option, to
       suppress the checking of the string for UTF-8 validity.


CALLOUTS

       If  the  pattern contains any callout requests, pcretest's
       callout function will be called. By default,  it  displays
       the callout number, and the start and current positions in
       the text at the callout time. For example, the output

         --->pqrabcdef
           0    ^  ^

       indicates that callout  number  0  occurred  for  a  match
       attempt  starting  at  the fourth character of the subject
       string, when the pointer was at the seventh character. The
       callout  function  returns  zero  (carry  on  matching) by
       default.

       Inserting callouts may be helpful when using  pcretest  to
       check  complicated regular expressions. For further infor-
       mation about callouts, see the pcrecallout  documentation.

       For  testing the PCRE library, additional control of call-
       out behaviour is available via  escape  sequences  in  the
       data,  as  described in the following section. In particu-
       lar, it is possible to pass in a number  as  callout  data
       (the  default is zero). If the callout function receives a
       non-zero number, it returns that value instead of zero.


DATA LINES

       Before each data line is passed  to  pcre_exec(),  leading
       and trailing whitespace is removed, and it is then scanned
       for \ escapes. Some of these are pretty esoteric features,
       intended  for  checking  out  some of the more complicated
       features of PCRE. If you are just testing "ordinary" regu-
       lar expressions, you probably don't need any of these. The
       following escapes are recognized:

         \a         alarm (= BEL)
         \b         backspace
         \e         escape
         \f         formfeed
         \n         newline
         \r         carriage return
         \t         tab
         \v         vertical tab
         \nnn       octal character (up to 3 octal digits)
         \xhh       hexadecimal character (up to 2 hex digits)
         \x{hh...}  hexadecimal character, any number of digits
                      in UTF-8 mode
         \A         pass the PCRE_ANCHORED option to pcre_exec()
         \B         pass the PCRE_NOTBOL option to pcre_exec()
         \Cdd       call pcre_copy_substring() for substring dd
                      after a successful match (any decimal  num-
       ber
                      less than 32)
         \Cname      call  pcre_copy_named_substring()  for  sub-
       string
                      "name" after a successful match (name  ter-
       min-
                      ated by next non alphanumeric character)
         \C+        show the current captured substrings at call-
       out
                      time
         \C-        do not supply a callout function
         \C!n       return 1 instead of 0 when callout  number  n
       is
                      reached
         \C!n!m      return  1 instead of 0 when callout number n
       is
                      reached for the nth time
         \C*n       pass the number n (may be negative) as  call-
       out
                      data
         \Gdd       call pcre_get_substring() for substring dd
                      after  a successful match (any decimal num-
       ber
                      less than 32)
         \Gname     call pcre_get_named_substring() for substring
                      "name"  after a successful match (name ter-
       min-
                      ated by next non-alphanumeric character)
         \L         call pcre_get_substringlist() after a
                      successful match
         \M         discover the minimum MATCH_LIMIT setting
         \N         pass the PCRE_NOTEMPTY option to pcre_exec()
         \Odd       set the size of the output vector passed to
                      pcre_exec() to dd (any number of decimal
                      digits)
         \S         output details of memory get/free calls  dur-
       ing matching
         \Z         pass the PCRE_NOTEOL option to pcre_exec()
         \?         pass the PCRE_NO_UTF8_CHECK option to
                      pcre_exec()

       If  \M  is  present,  pcretest  calls  pcre_exec() several
       times, with different values in the match_limit  field  of
       the  pcre_extra data structure, until it finds the minimum
       number that is needed for pcre_exec()  to  complete.  This
       number  is  a measure of the amount of recursion and back-
       tracking that takes place, and  checking  it  out  can  be
       instructive.  For most simple matches, the number is quite
       small, but for patterns with very large numbers of  match-
       ing  possibilities,  it can become large very quickly with
       increasing length of subject string.

       When \O is used, it may be higher or lower than  the  size
       set by the -O option (or defaulted to 45); \O applies only
       to the call of  pcre_exec()  for  the  line  in  which  it
       appears.

       A  backslash  followed  by  anything else just escapes the
       anything else. If the very last character is a  backslash,
       it  is  ignored. This gives a way of passing an empty line
       as data, since a  real  empty  line  terminates  the  data
       input.

       If  /P was present on the regex, causing the POSIX wrapper
       API to be used, only 0 causing REG_NOTBOL  and  REG_NOTEOL
       to be passed to regexec() respectively.

       The  use of \x{hh...} to represent UTF-8 characters is not
       dependent on the use of the /8 modifier on the pattern. It
       is recognized always. There may be any number of hexadeci-
       mal digits inside the braces. The result is  from  one  to
       six bytes, encoded according to the UTF-8 rules.


OUTPUT FROM PCRETEST

       When  a  match succeeds, pcretest outputs the list of cap-
       tured substrings that pcre_exec() returns,  starting  with
       number  0  for  the string that matched the whole pattern.
       Here is an example of an interactive pcretest run.

         $ pcretest
         PCRE version 4.00 08-Jan-2003

           re> /^abc(\d+)/
         data> abc123
          0: abc123
          1: 123
         data> xyz
         No match

       If the strings contain any non-printing  characters,  they
       are output as \0x escapes, or as \x{...} escapes if the /8
       modifier was present on the pattern. If  the  pattern  has
       the  /+  modifier, then the output for substring 0 is fol-
       lowed by the the rest of the subject string, identified by
       "0+" like this:

           re> /cat/+
         data> cataract
          0: cat
          0+ aract

       If  the  pattern has the /g or /G modifier, the results of
       successive matching attempts are output in sequence,  like
       this:

           re> /\Bi(\w\w)/g
         data> Mississippi
          0: iss
          1: ss
          0: iss
          1: ss
          0: ipp
          1: pp

       "No  match"  is  output  only  if  the first match attempt
       fails.

       If any of the sequences \C, \G, or \L  are  present  in  a
       data  line  that  is  successfully matched, the substrings
       extracted by the convenience functions are output with  C,
       G,  or  L after the string number instead of a colon. This
       is in addition to the normal full list. The string  length
       (that  is,  the  return  from  the extraction function) is
       given in parentheses after each string for \C and \G.

       Note that while patterns can  be  continued  over  several
       lines (a plain ">" prompt is used for continuations), data
       lines may not. However newlines can be included in data by
       means of the \n escape.


AUTHOR

       Philip Hazel 
       University Computing Service,
       Cambridge CB2 3QG, England.

Last updated: 09 December 2003
Copyright (c) 1997-2003 University of Cambridge.



                                                      PCRETEST(1)

Interix / SUAHosted at SUA Community for Interix, SUA and SFUInterix / SUA