[ Index ]

PHP Cross Reference of Unnamed Project

title

Body

[close]

/se3-unattended/var/se3/unattended/install/linuxaux/opt/perl/lib/5.10.0/pod/ -> perlfaq6.pod (source)

   1  =head1 NAME
   2  
   3  perlfaq6 - Regular Expressions ($Revision: 10126 $)
   4  
   5  =head1 DESCRIPTION
   6  
   7  This section is surprisingly small because the rest of the FAQ is
   8  littered with answers involving regular expressions.  For example,
   9  decoding a URL and checking whether something is a number are handled
  10  with regular expressions, but those answers are found elsewhere in
  11  this document (in L<perlfaq9>: "How do I decode or create those %-encodings
  12  on the web" and L<perlfaq4>: "How do I determine whether a scalar is
  13  a number/whole/integer/float", to be precise).
  14  
  15  =head2 How can I hope to use regular expressions without creating illegible and unmaintainable code?
  16  X<regex, legibility> X<regexp, legibility>
  17  X<regular expression, legibility> X</x>
  18  
  19  Three techniques can make regular expressions maintainable and
  20  understandable.
  21  
  22  =over 4
  23  
  24  =item Comments Outside the Regex
  25  
  26  Describe what you're doing and how you're doing it, using normal Perl
  27  comments.
  28  
  29      # turn the line into the first word, a colon, and the
  30      # number of characters on the rest of the line
  31      s/^(\w+)(.*)/ lc($1) . ":" . length($2) /meg;
  32  
  33  =item Comments Inside the Regex
  34  
  35  The C</x> modifier causes whitespace to be ignored in a regex pattern
  36  (except in a character class), and also allows you to use normal
  37  comments there, too.  As you can imagine, whitespace and comments help
  38  a lot.
  39  
  40  C</x> lets you turn this:
  41  
  42      s{<(?:[^>'"]*|".*?"|'.*?')+>}{}gs;
  43  
  44  into this:
  45  
  46      s{ <                    # opening angle bracket
  47          (?:                 # Non-backreffing grouping paren
  48              [^>'"] *        # 0 or more things that are neither > nor ' nor "
  49                  |           #    or else
  50              ".*?"           # a section between double quotes (stingy match)
  51                  |           #    or else
  52              '.*?'           # a section between single quotes (stingy match)
  53          ) +                 #   all occurring one or more times
  54          >                   # closing angle bracket
  55      }{}gsx;                 # replace with nothing, i.e. delete
  56  
  57  It's still not quite so clear as prose, but it is very useful for
  58  describing the meaning of each part of the pattern.
  59  
  60  =item Different Delimiters
  61  
  62  While we normally think of patterns as being delimited with C</>
  63  characters, they can be delimited by almost any character.  L<perlre>
  64  describes this.  For example, the C<s///> above uses braces as
  65  delimiters.  Selecting another delimiter can avoid quoting the
  66  delimiter within the pattern:
  67  
  68      s/\/usr\/local/\/usr\/share/g;    # bad delimiter choice
  69      s#/usr/local#/usr/share#g;        # better
  70  
  71  =back
  72  
  73  =head2 I'm having trouble matching over more than one line.  What's wrong?
  74  X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
  75  
  76  Either you don't have more than one line in the string you're looking
  77  at (probably), or else you aren't using the correct modifier(s) on
  78  your pattern (possibly).
  79  
  80  There are many ways to get multiline data into a string.  If you want
  81  it to happen automatically while reading input, you'll want to set $/
  82  (probably to '' for paragraphs or C<undef> for the whole file) to
  83  allow you to read more than one line at a time.
  84  
  85  Read L<perlre> to help you decide which of C</s> and C</m> (or both)
  86  you might want to use: C</s> allows dot to include newline, and C</m>
  87  allows caret and dollar to match next to a newline, not just at the
  88  end of the string.  You do need to make sure that you've actually
  89  got a multiline string in there.
  90  
  91  For example, this program detects duplicate words, even when they span
  92  line breaks (but not paragraph ones).  For this example, we don't need
  93  C</s> because we aren't using dot in a regular expression that we want
  94  to cross line boundaries.  Neither do we need C</m> because we aren't
  95  wanting caret or dollar to match at any point inside the record next
  96  to newlines.  But it's imperative that $/ be set to something other
  97  than the default, or else we won't actually ever have a multiline
  98  record read in.
  99  
 100      $/ = '';          # read in more whole paragraph, not just one line
 101      while ( <> ) {
 102          while ( /\b([\w'-]+)(\s+\1)+\b/gi ) {      # word starts alpha
 103              print "Duplicate $1 at paragraph $.\n";
 104          }
 105      }
 106  
 107  Here's code that finds sentences that begin with "From " (which would
 108  be mangled by many mailers):
 109  
 110      $/ = '';          # read in more whole paragraph, not just one line
 111      while ( <> ) {
 112          while ( /^From /gm ) { # /m makes ^ match next to \n
 113          print "leading from in paragraph $.\n";
 114          }
 115      }
 116  
 117  Here's code that finds everything between START and END in a paragraph:
 118  
 119      undef $/;          # read in whole file, not just one line or paragraph
 120      while ( <> ) {
 121          while ( /START(.*?)END/sgm ) { # /s makes . cross line boundaries
 122              print "$1\n";
 123          }
 124      }
 125  
 126  =head2 How can I pull out lines between two patterns that are themselves on different lines?
 127  X<..>
 128  
 129  You can use Perl's somewhat exotic C<..> operator (documented in
 130  L<perlop>):
 131  
 132      perl -ne 'print if /START/ .. /END/' file1 file2 ...
 133  
 134  If you wanted text and not lines, you would use
 135  
 136      perl -0777 -ne 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
 137  
 138  But if you want nested occurrences of C<START> through C<END>, you'll
 139  run up against the problem described in the question in this section
 140  on matching balanced text.
 141  
 142  Here's another example of using C<..>:
 143  
 144      while (<>) {
 145          $in_header =   1  .. /^$/;
 146          $in_body   = /^$/ .. eof;
 147      # now choose between them
 148      } continue {
 149          $. = 0 if eof;    # fix $.
 150      }
 151  
 152  =head2 I put a regular expression into $/ but it didn't work. What's wrong?
 153  X<$/, regexes in> X<$INPUT_RECORD_SEPARATOR, regexes in>
 154  X<$RS, regexes in>
 155  
 156  $/ has to be a string.  You can use these examples if you really need to 
 157  do this.
 158  
 159  If you have File::Stream, this is easy.
 160  
 161      use File::Stream;
 162  
 163      my $stream = File::Stream->new(
 164          $filehandle,
 165          separator => qr/\s*,\s*/,
 166          );
 167  
 168      print "$_\n" while <$stream>;
 169  
 170  If you don't have File::Stream, you have to do a little more work.
 171  
 172  You can use the four argument form of sysread to continually add to
 173  a buffer.  After you add to the buffer, you check if you have a
 174  complete line (using your regular expression).
 175  
 176      local $_ = "";
 177      while( sysread FH, $_, 8192, length ) {
 178          while( s/^((?s).*?)your_pattern/ ) {
 179              my $record = $1;
 180              # do stuff here.
 181          }
 182      }
 183  
 184   You can do the same thing with foreach and a match using the
 185   c flag and the \G anchor, if you do not mind your entire file
 186   being in memory at the end.
 187  
 188      local $_ = "";
 189      while( sysread FH, $_, 8192, length ) {
 190          foreach my $record ( m/\G((?s).*?)your_pattern/gc ) {
 191              # do stuff here.
 192          }
 193      substr( $_, 0, pos ) = "" if pos;
 194      }
 195  
 196  
 197  =head2 How do I substitute case insensitively on the LHS while preserving case on the RHS?
 198  X<replace, case preserving> X<substitute, case preserving>
 199  X<substitution, case preserving> X<s, case preserving>
 200  
 201  Here's a lovely Perlish solution by Larry Rosler.  It exploits
 202  properties of bitwise xor on ASCII strings.
 203  
 204      $_= "this is a TEsT case";
 205  
 206      $old = 'test';
 207      $new = 'success';
 208  
 209      s{(\Q$old\E)}
 210      { uc $new | (uc $1 ^ $1) .
 211          (uc(substr $1, -1) ^ substr $1, -1) x
 212          (length($new) - length $1)
 213      }egi;
 214  
 215      print;
 216  
 217  And here it is as a subroutine, modeled after the above:
 218  
 219      sub preserve_case($$) {
 220          my ($old, $new) = @_;
 221          my $mask = uc $old ^ $old;
 222  
 223          uc $new | $mask .
 224              substr($mask, -1) x (length($new) - length($old))
 225      }
 226  
 227      $a = "this is a TEsT case";
 228      $a =~ s/(test)/preserve_case($1, "success")/egi;
 229      print "$a\n";
 230  
 231  This prints:
 232  
 233      this is a SUcCESS case
 234  
 235  As an alternative, to keep the case of the replacement word if it is
 236  longer than the original, you can use this code, by Jeff Pinyan:
 237  
 238      sub preserve_case {
 239          my ($from, $to) = @_;
 240          my ($lf, $lt) = map length, @_;
 241  
 242          if ($lt < $lf) { $from = substr $from, 0, $lt }
 243          else { $from .= substr $to, $lf }
 244  
 245          return uc $to | ($from ^ uc $from);
 246          }
 247  
 248  This changes the sentence to "this is a SUcCess case."
 249  
 250  Just to show that C programmers can write C in any programming language,
 251  if you prefer a more C-like solution, the following script makes the
 252  substitution have the same case, letter by letter, as the original.
 253  (It also happens to run about 240% slower than the Perlish solution runs.)
 254  If the substitution has more characters than the string being substituted,
 255  the case of the last character is used for the rest of the substitution.
 256  
 257      # Original by Nathan Torkington, massaged by Jeffrey Friedl
 258      #
 259      sub preserve_case($$)
 260      {
 261          my ($old, $new) = @_;
 262          my ($state) = 0; # 0 = no change; 1 = lc; 2 = uc
 263          my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new));
 264          my ($len) = $oldlen < $newlen ? $oldlen : $newlen;
 265  
 266          for ($i = 0; $i < $len; $i++) {
 267              if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) {
 268                  $state = 0;
 269              } elsif (lc $c eq $c) {
 270                  substr($new, $i, 1) = lc(substr($new, $i, 1));
 271                  $state = 1;
 272              } else {
 273                  substr($new, $i, 1) = uc(substr($new, $i, 1));
 274                  $state = 2;
 275              }
 276          }
 277          # finish up with any remaining new (for when new is longer than old)
 278          if ($newlen > $oldlen) {
 279              if ($state == 1) {
 280                  substr($new, $oldlen) = lc(substr($new, $oldlen));
 281              } elsif ($state == 2) {
 282                  substr($new, $oldlen) = uc(substr($new, $oldlen));
 283              }
 284          }
 285          return $new;
 286      }
 287  
 288  =head2 How can I make C<\w> match national character sets?
 289  X<\w>
 290  
 291  Put C<use locale;> in your script.  The \w character class is taken
 292  from the current locale.
 293  
 294  See L<perllocale> for details.
 295  
 296  =head2 How can I match a locale-smart version of C</[a-zA-Z]/>?
 297  X<alpha>
 298  
 299  You can use the POSIX character class syntax C</[[:alpha:]]/>
 300  documented in L<perlre>.
 301  
 302  No matter which locale you are in, the alphabetic characters are
 303  the characters in \w without the digits and the underscore.
 304  As a regex, that looks like C</[^\W\d_]/>.  Its complement,
 305  the non-alphabetics, is then everything in \W along with
 306  the digits and the underscore, or C</[\W\d_]/>.
 307  
 308  =head2 How can I quote a variable to use in a regex?
 309  X<regex, escaping> X<regexp, escaping> X<regular expression, escaping>
 310  
 311  The Perl parser will expand $variable and @variable references in
 312  regular expressions unless the delimiter is a single quote.  Remember,
 313  too, that the right-hand side of a C<s///> substitution is considered
 314  a double-quoted string (see L<perlop> for more details).  Remember
 315  also that any regex special characters will be acted on unless you
 316  precede the substitution with \Q.  Here's an example:
 317  
 318      $string = "Placido P. Octopus";
 319      $regex  = "P.";
 320  
 321      $string =~ s/$regex/Polyp/;
 322      # $string is now "Polypacido P. Octopus"
 323  
 324  Because C<.> is special in regular expressions, and can match any
 325  single character, the regex C<P.> here has matched the <Pl> in the
 326  original string.
 327  
 328  To escape the special meaning of C<.>, we use C<\Q>:
 329  
 330      $string = "Placido P. Octopus";
 331      $regex  = "P.";
 332  
 333      $string =~ s/\Q$regex/Polyp/;
 334      # $string is now "Placido Polyp Octopus"
 335  
 336  The use of C<\Q> causes the <.> in the regex to be treated as a
 337  regular character, so that C<P.> matches a C<P> followed by a dot.
 338  
 339  =head2 What is C</o> really for?
 340  X</o, regular expressions> X<compile, regular expressions>
 341  
 342  (contributed by brian d foy)
 343  
 344  The C</o> option for regular expressions (documented in L<perlop> and
 345  L<perlreref>) tells Perl to compile the regular expression only once.
 346  This is only useful when the pattern contains a variable. Perls 5.6
 347  and later handle this automatically if the pattern does not change.
 348  
 349  Since the match operator C<m//>, the substitution operator C<s///>,
 350  and the regular expression quoting operator C<qr//> are double-quotish
 351  constructs, you can interpolate variables into the pattern. See the
 352  answer to "How can I quote a variable to use in a regex?" for more
 353  details.
 354  
 355  This example takes a regular expression from the argument list and
 356  prints the lines of input that match it:
 357  
 358      my $pattern = shift @ARGV;
 359      
 360      while( <> ) {
 361          print if m/$pattern/;
 362          }
 363  
 364  Versions of Perl prior to 5.6 would recompile the regular expression
 365  for each iteration, even if C<$pattern> had not changed. The C</o>
 366  would prevent this by telling Perl to compile the pattern the first
 367  time, then reuse that for subsequent iterations:
 368  
 369      my $pattern = shift @ARGV;
 370      
 371      while( <> ) {
 372          print if m/$pattern/o; # useful for Perl < 5.6
 373          }
 374  
 375  In versions 5.6 and later, Perl won't recompile the regular expression
 376  if the variable hasn't changed, so you probably don't need the C</o>
 377  option. It doesn't hurt, but it doesn't help either. If you want any
 378  version of Perl to compile the regular expression only once even if
 379  the variable changes (thus, only using its initial value), you still
 380  need the C</o>.
 381  
 382  You can watch Perl's regular expression engine at work to verify for
 383  yourself if Perl is recompiling a regular expression. The C<use re
 384  'debug'> pragma (comes with Perl 5.005 and later) shows the details.
 385  With Perls before 5.6, you should see C<re> reporting that its
 386  compiling the regular expression on each iteration. With Perl 5.6 or
 387  later, you should only see C<re> report that for the first iteration.
 388  
 389      use re 'debug';
 390      
 391      $regex = 'Perl';
 392      foreach ( qw(Perl Java Ruby Python) ) {
 393          print STDERR "-" x 73, "\n";
 394          print STDERR "Trying $_...\n";
 395          print STDERR "\t$_ is good!\n" if m/$regex/;
 396          }
 397  
 398  =head2 How do I use a regular expression to strip C style comments from a file?
 399  
 400  While this actually can be done, it's much harder than you'd think.
 401  For example, this one-liner
 402  
 403      perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c
 404  
 405  will work in many but not all cases.  You see, it's too simple-minded for
 406  certain kinds of C programs, in particular, those with what appear to be
 407  comments in quoted strings.  For that, you'd need something like this,
 408  created by Jeffrey Friedl and later modified by Fred Curtis.
 409  
 410      $/ = undef;
 411      $_ = <>;
 412      s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse;
 413      print;
 414  
 415  This could, of course, be more legibly written with the C</x> modifier, adding
 416  whitespace and comments.  Here it is expanded, courtesy of Fred Curtis.
 417  
 418      s{
 419         /\*         ##  Start of /* ... */ comment
 420         [^*]*\*+    ##  Non-* followed by 1-or-more *'s
 421         (
 422           [^/*][^*]*\*+
 423         )*          ##  0-or-more things which don't start with /
 424                     ##    but do end with '*'
 425         /           ##  End of /* ... */ comment
 426  
 427       |         ##     OR  various things which aren't comments:
 428  
 429         (
 430           "           ##  Start of " ... " string
 431           (
 432             \\.           ##  Escaped char
 433           |               ##    OR
 434             [^"\\]        ##  Non "\
 435           )*
 436           "           ##  End of " ... " string
 437  
 438         |         ##     OR
 439  
 440           '           ##  Start of ' ... ' string
 441           (
 442             \\.           ##  Escaped char
 443           |               ##    OR
 444             [^'\\]        ##  Non '\
 445           )*
 446           '           ##  End of ' ... ' string
 447  
 448         |         ##     OR
 449  
 450           .           ##  Anything other char
 451           [^/"'\\]*   ##  Chars which doesn't start a comment, string or escape
 452         )
 453       }{defined $2 ? $2 : ""}gxse;
 454  
 455  A slight modification also removes C++ comments, as long as they are not
 456  spread over multiple lines using a continuation character):
 457  
 458      s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//[^\n]*|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse;
 459  
 460  =head2 Can I use Perl regular expressions to match balanced text?
 461  X<regex, matching balanced test> X<regexp, matching balanced test>
 462  X<regular expression, matching balanced test>
 463  
 464  Historically, Perl regular expressions were not capable of matching
 465  balanced text.  As of more recent versions of perl including 5.6.1
 466  experimental features have been added that make it possible to do this.
 467  Look at the documentation for the (??{ }) construct in recent perlre manual
 468  pages to see an example of matching balanced parentheses.  Be sure to take
 469  special notice of the  warnings present in the manual before making use
 470  of this feature.
 471  
 472  CPAN contains many modules that can be useful for matching text
 473  depending on the context.  Damian Conway provides some useful
 474  patterns in Regexp::Common.  The module Text::Balanced provides a
 475  general solution to this problem.
 476  
 477  One of the common applications of balanced text matching is working
 478  with XML and HTML.  There are many modules available that support
 479  these needs.  Two examples are HTML::Parser and XML::Parser. There
 480  are many others.
 481  
 482  An elaborate subroutine (for 7-bit ASCII only) to pull out balanced
 483  and possibly nested single chars, like C<`> and C<'>, C<{> and C<}>,
 484  or C<(> and C<)> can be found in
 485  http://www.cpan.org/authors/id/TOMC/scripts/pull_quotes.gz .
 486  
 487  The C::Scan module from CPAN also contains such subs for internal use,
 488  but they are undocumented.
 489  
 490  =head2 What does it mean that regexes are greedy?  How can I get around it?
 491  X<greedy> X<greediness>
 492  
 493  Most people mean that greedy regexes match as much as they can.
 494  Technically speaking, it's actually the quantifiers (C<?>, C<*>, C<+>,
 495  C<{}>) that are greedy rather than the whole pattern; Perl prefers local
 496  greed and immediate gratification to overall greed.  To get non-greedy
 497  versions of the same quantifiers, use (C<??>, C<*?>, C<+?>, C<{}?>).
 498  
 499  An example:
 500  
 501      $s1 = $s2 = "I am very very cold";
 502      $s1 =~ s/ve.*y //;      # I am cold
 503      $s2 =~ s/ve.*?y //;     # I am very cold
 504  
 505  Notice how the second substitution stopped matching as soon as it
 506  encountered "y ".  The C<*?> quantifier effectively tells the regular
 507  expression engine to find a match as quickly as possible and pass
 508  control on to whatever is next in line, like you would if you were
 509  playing hot potato.
 510  
 511  =head2 How do I process each word on each line?
 512  X<word>
 513  
 514  Use the split function:
 515  
 516      while (<>) {
 517          foreach $word ( split ) {
 518              # do something with $word here
 519          }
 520      }
 521  
 522  Note that this isn't really a word in the English sense; it's just
 523  chunks of consecutive non-whitespace characters.
 524  
 525  To work with only alphanumeric sequences (including underscores), you
 526  might consider
 527  
 528      while (<>) {
 529          foreach $word (m/(\w+)/g) {
 530              # do something with $word here
 531          }
 532      }
 533  
 534  =head2 How can I print out a word-frequency or line-frequency summary?
 535  
 536  To do this, you have to parse out each word in the input stream.  We'll
 537  pretend that by word you mean chunk of alphabetics, hyphens, or
 538  apostrophes, rather than the non-whitespace chunk idea of a word given
 539  in the previous question:
 540  
 541      while (<>) {
 542          while ( /(\b[^\W_\d][\w'-]+\b)/g ) {   # misses "`sheep'"
 543              $seen{$1}++;
 544          }
 545      }
 546  
 547      while ( ($word, $count) = each %seen ) {
 548          print "$count $word\n";
 549          }
 550  
 551  If you wanted to do the same thing for lines, you wouldn't need a
 552  regular expression:
 553  
 554      while (<>) {
 555          $seen{$_}++;
 556          }
 557  
 558      while ( ($line, $count) = each %seen ) {
 559          print "$count $line";
 560      }
 561  
 562  If you want these output in a sorted order, see L<perlfaq4>: "How do I
 563  sort a hash (optionally by value instead of key)?".
 564  
 565  =head2 How can I do approximate matching?
 566  X<match, approximate> X<matching, approximate>
 567  
 568  See the module String::Approx available from CPAN.
 569  
 570  =head2 How do I efficiently match many regular expressions at once?
 571  X<regex, efficiency> X<regexp, efficiency>
 572  X<regular expression, efficiency>
 573  
 574  ( contributed by brian d foy )
 575  
 576  Avoid asking Perl to compile a regular expression every time
 577  you want to match it.  In this example, perl must recompile
 578  the regular expression for every iteration of the foreach()
 579  loop since it has no way to know what $pattern will be.
 580  
 581      @patterns = qw( foo bar baz );
 582  
 583      LINE: while( <DATA> )
 584          {
 585          foreach $pattern ( @patterns )
 586              {
 587              if( /\b$pattern\b/i )
 588                  {
 589                  print;
 590                  next LINE;
 591                  }
 592              }
 593          }
 594  
 595  The qr// operator showed up in perl 5.005.  It compiles a
 596  regular expression, but doesn't apply it.  When you use the
 597  pre-compiled version of the regex, perl does less work. In
 598  this example, I inserted a map() to turn each pattern into
 599  its pre-compiled form.  The rest of the script is the same,
 600  but faster.
 601  
 602      @patterns = map { qr/\b$_\b/i } qw( foo bar baz );
 603  
 604      LINE: while( <> )
 605          {
 606          foreach $pattern ( @patterns )
 607              {
 608              print if /$pattern/i;
 609              next LINE;
 610              }
 611          }
 612  
 613  In some cases, you may be able to make several patterns into
 614  a single regular expression.  Beware of situations that require
 615  backtracking though.
 616  
 617      $regex = join '|', qw( foo bar baz );
 618  
 619      LINE: while( <> )
 620          {
 621          print if /\b(?:$regex)\b/i;
 622          }
 623  
 624  For more details on regular expression efficiency, see Mastering
 625  Regular Expressions by Jeffrey Freidl.  He explains how regular
 626  expressions engine work and why some patterns are surprisingly
 627  inefficient.  Once you understand how perl applies regular
 628  expressions, you can tune them for individual situations.
 629  
 630  =head2 Why don't word-boundary searches with C<\b> work for me?
 631  X<\b>
 632  
 633  (contributed by brian d foy)
 634  
 635  Ensure that you know what \b really does: it's the boundary between a
 636  word character, \w, and something that isn't a word character. That
 637  thing that isn't a word character might be \W, but it can also be the
 638  start or end of the string.
 639  
 640  It's not (not!) the boundary between whitespace and non-whitespace,
 641  and it's not the stuff between words we use to create sentences.
 642  
 643  In regex speak, a word boundary (\b) is a "zero width assertion",
 644  meaning that it doesn't represent a character in the string, but a
 645  condition at a certain position.
 646  
 647  For the regular expression, /\bPerl\b/, there has to be a word
 648  boundary before the "P" and after the "l".  As long as something other
 649  than a word character precedes the "P" and succeeds the "l", the
 650  pattern will match. These strings match /\bPerl\b/.
 651  
 652      "Perl"    # no word char before P or after l
 653      "Perl "   # same as previous (space is not a word char)
 654      "'Perl'"  # the ' char is not a word char
 655      "Perl's"  # no word char before P, non-word char after "l"
 656  
 657  These strings do not match /\bPerl\b/.
 658  
 659      "Perl_"   # _ is a word char!
 660      "Perler"  # no word char before P, but one after l
 661  
 662  You don't have to use \b to match words though.  You can look for
 663  non-word characters surrounded by word characters.  These strings
 664  match the pattern /\b'\b/.
 665  
 666      "don't"   # the ' char is surrounded by "n" and "t"
 667      "qep'a'"  # the ' char is surrounded by "p" and "a"
 668  
 669  These strings do not match /\b'\b/.
 670  
 671      "foo'"    # there is no word char after non-word '
 672  
 673  You can also use the complement of \b, \B, to specify that there
 674  should not be a word boundary.
 675  
 676  In the pattern /\Bam\B/, there must be a word character before the "a"
 677  and after the "m". These patterns match /\Bam\B/:
 678  
 679      "llama"   # "am" surrounded by word chars
 680      "Samuel"  # same
 681  
 682  These strings do not match /\Bam\B/
 683  
 684      "Sam"      # no word boundary before "a", but one after "m"
 685      "I am Sam" # "am" surrounded by non-word chars
 686  
 687  
 688  =head2 Why does using $&, $`, or $' slow my program down?
 689  X<$MATCH> X<$&> X<$POSTMATCH> X<$'> X<$PREMATCH> X<$`>
 690  
 691  (contributed by Anno Siegel)
 692  
 693  Once Perl sees that you need one of these variables anywhere in the
 694  program, it provides them on each and every pattern match. That means
 695  that on every pattern match the entire string will be copied, part of it
 696  to $`, part to $&, and part to $'. Thus the penalty is most severe with
 697  long strings and patterns that match often. Avoid $&, $', and $` if you
 698  can, but if you can't, once you've used them at all, use them at will
 699  because you've already paid the price. Remember that some algorithms
 700  really appreciate them. As of the 5.005 release, the $& variable is no
 701  longer "expensive" the way the other two are.
 702  
 703  Since Perl 5.6.1 the special variables @- and @+ can functionally replace
 704  $`, $& and $'.  These arrays contain pointers to the beginning and end
 705  of each match (see perlvar for the full story), so they give you
 706  essentially the same information, but without the risk of excessive
 707  string copying.
 708  
 709  =head2 What good is C<\G> in a regular expression?
 710  X<\G>
 711  
 712  You use the C<\G> anchor to start the next match on the same
 713  string where the last match left off.  The regular
 714  expression engine cannot skip over any characters to find
 715  the next match with this anchor, so C<\G> is similar to the
 716  beginning of string anchor, C<^>.  The C<\G> anchor is typically
 717  used with the C<g> flag.  It uses the value of C<pos()>
 718  as the position to start the next match.  As the match
 719  operator makes successive matches, it updates C<pos()> with the
 720  position of the next character past the last match (or the
 721  first character of the next match, depending on how you like
 722  to look at it). Each string has its own C<pos()> value.
 723  
 724  Suppose you want to match all of consecutive pairs of digits
 725  in a string like "1122a44" and stop matching when you
 726  encounter non-digits.  You want to match C<11> and C<22> but
 727  the letter <a> shows up between C<22> and C<44> and you want
 728  to stop at C<a>. Simply matching pairs of digits skips over
 729  the C<a> and still matches C<44>.
 730  
 731      $_ = "1122a44";
 732      my @pairs = m/(\d\d)/g;   # qw( 11 22 44 )
 733  
 734  If you use the C<\G> anchor, you force the match after C<22> to
 735  start with the C<a>.  The regular expression cannot match
 736  there since it does not find a digit, so the next match
 737  fails and the match operator returns the pairs it already
 738  found.
 739  
 740      $_ = "1122a44";
 741      my @pairs = m/\G(\d\d)/g; # qw( 11 22 )
 742  
 743  You can also use the C<\G> anchor in scalar context. You
 744  still need the C<g> flag.
 745  
 746      $_ = "1122a44";
 747      while( m/\G(\d\d)/g )
 748          {
 749          print "Found $1\n";
 750          }
 751  
 752  After the match fails at the letter C<a>, perl resets C<pos()>
 753  and the next match on the same string starts at the beginning.
 754  
 755      $_ = "1122a44";
 756      while( m/\G(\d\d)/g )
 757          {
 758          print "Found $1\n";
 759          }
 760  
 761      print "Found $1 after while" if m/(\d\d)/g; # finds "11"
 762  
 763  You can disable C<pos()> resets on fail with the C<c> flag, documented
 764  in L<perlop> and L<perlreref>. Subsequent matches start where the last
 765  successful match ended (the value of C<pos()>) even if a match on the
 766  same string has failed in the meantime. In this case, the match after
 767  the C<while()> loop starts at the C<a> (where the last match stopped),
 768  and since it does not use any anchor it can skip over the C<a> to find
 769  C<44>.
 770  
 771      $_ = "1122a44";
 772      while( m/\G(\d\d)/gc )
 773          {
 774          print "Found $1\n";
 775          }
 776  
 777      print "Found $1 after while" if m/(\d\d)/g; # finds "44"
 778  
 779  Typically you use the C<\G> anchor with the C<c> flag
 780  when you want to try a different match if one fails,
 781  such as in a tokenizer. Jeffrey Friedl offers this example
 782  which works in 5.004 or later.
 783  
 784      while (<>) {
 785          chomp;
 786          PARSER: {
 787              m/ \G( \d+\b    )/gcx   && do { print "number: $1\n";  redo; };
 788              m/ \G( \w+      )/gcx   && do { print "word:   $1\n";  redo; };
 789              m/ \G( \s+      )/gcx   && do { print "space:  $1\n";  redo; };
 790              m/ \G( [^\w\d]+ )/gcx   && do { print "other:  $1\n";  redo; };
 791          }
 792      }
 793  
 794  For each line, the C<PARSER> loop first tries to match a series
 795  of digits followed by a word boundary.  This match has to
 796  start at the place the last match left off (or the beginning
 797  of the string on the first match). Since C<m/ \G( \d+\b
 798  )/gcx> uses the C<c> flag, if the string does not match that
 799  regular expression, perl does not reset pos() and the next
 800  match starts at the same position to try a different
 801  pattern.
 802  
 803  =head2 Are Perl regexes DFAs or NFAs?  Are they POSIX compliant?
 804  X<DFA> X<NFA> X<POSIX>
 805  
 806  While it's true that Perl's regular expressions resemble the DFAs
 807  (deterministic finite automata) of the egrep(1) program, they are in
 808  fact implemented as NFAs (non-deterministic finite automata) to allow
 809  backtracking and backreferencing.  And they aren't POSIX-style either,
 810  because those guarantee worst-case behavior for all cases.  (It seems
 811  that some people prefer guarantees of consistency, even when what's
 812  guaranteed is slowness.)  See the book "Mastering Regular Expressions"
 813  (from O'Reilly) by Jeffrey Friedl for all the details you could ever
 814  hope to know on these matters (a full citation appears in
 815  L<perlfaq2>).
 816  
 817  =head2 What's wrong with using grep in a void context?
 818  X<grep>
 819  
 820  The problem is that grep builds a return list, regardless of the context.
 821  This means you're making Perl go to the trouble of building a list that
 822  you then just throw away. If the list is large, you waste both time and space.
 823  If your intent is to iterate over the list, then use a for loop for this
 824  purpose.
 825  
 826  In perls older than 5.8.1, map suffers from this problem as well.
 827  But since 5.8.1, this has been fixed, and map is context aware - in void
 828  context, no lists are constructed.
 829  
 830  =head2 How can I match strings with multibyte characters?
 831  X<regex, and multibyte characters> X<regexp, and multibyte characters>
 832  X<regular expression, and multibyte characters> X<martian> X<encoding, Martian>
 833  
 834  Starting from Perl 5.6 Perl has had some level of multibyte character
 835  support.  Perl 5.8 or later is recommended.  Supported multibyte
 836  character repertoires include Unicode, and legacy encodings
 837  through the Encode module.  See L<perluniintro>, L<perlunicode>,
 838  and L<Encode>.
 839  
 840  If you are stuck with older Perls, you can do Unicode with the
 841  C<Unicode::String> module, and character conversions using the
 842  C<Unicode::Map8> and C<Unicode::Map> modules.  If you are using
 843  Japanese encodings, you might try using the jperl 5.005_03.
 844  
 845  Finally, the following set of approaches was offered by Jeffrey
 846  Friedl, whose article in issue #5 of The Perl Journal talks about
 847  this very matter.
 848  
 849  Let's suppose you have some weird Martian encoding where pairs of
 850  ASCII uppercase letters encode single Martian letters (i.e. the two
 851  bytes "CV" make a single Martian letter, as do the two bytes "SG",
 852  "VS", "XX", etc.). Other bytes represent single characters, just like
 853  ASCII.
 854  
 855  So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
 856  nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
 857  
 858  Now, say you want to search for the single character C</GX/>. Perl
 859  doesn't know about Martian, so it'll find the two bytes "GX" in the "I
 860  am CVSGXX!"  string, even though that character isn't there: it just
 861  looks like it is because "SG" is next to "XX", but there's no real
 862  "GX".  This is a big problem.
 863  
 864  Here are a few ways, all painful, to deal with it:
 865  
 866      # Make sure adjacent "martian" bytes are no longer adjacent.
 867      $martian =~ s/([A-Z][A-Z])/ $1 /g;
 868  
 869      print "found GX!\n" if $martian =~ /GX/;
 870  
 871  Or like this:
 872  
 873      @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
 874      # above is conceptually similar to:     @chars = $text =~ m/(.)/g;
 875      #
 876      foreach $char (@chars) {
 877      print "found GX!\n", last if $char eq 'GX';
 878      }
 879  
 880  Or like this:
 881  
 882      while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) {  # \G probably unneeded
 883          print "found GX!\n", last if $1 eq 'GX';
 884          }
 885  
 886  Here's another, slightly less painful, way to do it from Benjamin
 887  Goldberg, who uses a zero-width negative look-behind assertion.
 888  
 889      print "found GX!\n" if    $martian =~ m/
 890          (?<![A-Z])
 891          (?:[A-Z][A-Z])*?
 892          GX
 893          /x;
 894  
 895  This succeeds if the "martian" character GX is in the string, and fails
 896  otherwise.  If you don't like using (?<!), a zero-width negative
 897  look-behind assertion, you can replace (?<![A-Z]) with (?:^|[^A-Z]).
 898  
 899  It does have the drawback of putting the wrong thing in $-[0] and $+[0],
 900  but this usually can be worked around.
 901  
 902  =head2 How do I match a regular expression that's in a variable?
 903  X<regex, in variable> X<eval> X<regex> X<quotemeta> X<\Q, regex>
 904  X<\E, regex>, X<qr//>
 905  
 906  (contributed by brian d foy)
 907  
 908  We don't have to hard-code patterns into the match operator (or
 909  anything else that works with regular expressions). We can put the
 910  pattern in a variable for later use.
 911  
 912  The match operator is a double quote context, so you can interpolate
 913  your variable just like a double quoted string. In this case, you
 914  read the regular expression as user input and store it in C<$regex>.
 915  Once you have the pattern in C<$regex>, you use that variable in the
 916  match operator.
 917  
 918      chomp( my $regex = <STDIN> );
 919  
 920      if( $string =~ m/$regex/ ) { ... }
 921  
 922  Any regular expression special characters in C<$regex> are still
 923  special, and the pattern still has to be valid or Perl will complain.
 924  For instance, in this pattern there is an unpaired parenthesis.
 925  
 926      my $regex = "Unmatched ( paren";
 927  
 928      "Two parens to bind them all" =~ m/$regex/;
 929  
 930  When Perl compiles the regular expression, it treats the parenthesis
 931  as the start of a memory match. When it doesn't find the closing
 932  parenthesis, it complains:
 933  
 934      Unmatched ( in regex; marked by <-- HERE in m/Unmatched ( <-- HERE  paren/ at script line 3.
 935  
 936  You can get around this in several ways depending on our situation.
 937  First, if you don't want any of the characters in the string to be
 938  special, you can escape them with C<quotemeta> before you use the string.
 939  
 940      chomp( my $regex = <STDIN> );
 941      $regex = quotemeta( $regex );
 942  
 943      if( $string =~ m/$regex/ ) { ... }
 944  
 945  You can also do this directly in the match operator using the C<\Q>
 946  and C<\E> sequences. The C<\Q> tells Perl where to start escaping
 947  special characters, and the C<\E> tells it where to stop (see L<perlop>
 948  for more details).
 949  
 950      chomp( my $regex = <STDIN> );
 951  
 952      if( $string =~ m/\Q$regex\E/ ) { ... }
 953  
 954  Alternately, you can use C<qr//>, the regular expression quote operator (see
 955  L<perlop> for more details).  It quotes and perhaps compiles the pattern,
 956  and you can apply regular expression flags to the pattern.
 957  
 958      chomp( my $input = <STDIN> );
 959  
 960      my $regex = qr/$input/is;
 961  
 962      $string =~ m/$regex/  # same as m/$input/is;
 963  
 964  You might also want to trap any errors by wrapping an C<eval> block
 965  around the whole thing.
 966  
 967      chomp( my $input = <STDIN> );
 968  
 969      eval {
 970          if( $string =~ m/\Q$input\E/ ) { ... }
 971          };
 972      warn $@ if $@;
 973  
 974  Or...
 975  
 976      my $regex = eval { qr/$input/is };
 977      if( defined $regex ) {
 978          $string =~ m/$regex/;
 979          }
 980      else {
 981          warn $@;
 982          }
 983  
 984  =head1 REVISION
 985  
 986  Revision: $Revision: 10126 $
 987  
 988  Date: $Date: 2007-10-27 21:29:20 +0200 (Sat, 27 Oct 2007) $
 989  
 990  See L<perlfaq> for source control details and availability.
 991  
 992  =head1 AUTHOR AND COPYRIGHT
 993  
 994  Copyright (c) 1997-2007 Tom Christiansen, Nathan Torkington, and
 995  other authors as noted. All rights reserved.
 996  
 997  This documentation is free; you can redistribute it and/or modify it
 998  under the same terms as Perl itself.
 999  
1000  Irrespective of its distribution, all code examples in this file
1001  are hereby placed into the public domain.  You are permitted and
1002  encouraged to use this code in your own programs for fun
1003  or for profit as you see fit.  A simple comment in the code giving
1004  credit would be courteous but is not required.


Generated: Tue Mar 17 22:47:18 2015 Cross-referenced by PHPXref 0.7.1