New regular expression features in PCRE 8.34 and 8.35

Posted by Jan Goyvaerts on Regular-Expressions.info See other posts from Regular-Expressions.info or by Jan Goyvaerts
Published on Wed, 14 May 2014 06:26:59 +0000 Indexed on 2014/05/26 21:28 UTC
Read the original article Hit count: 600

Filed under:

PCRE 8.34 adds some new regex features and changes the behavior of a few to make it better compatible with the latest versions of Perl. There are no changes to the regex syntax in PCRE 8.35.

\o{377} is now an octal escape just like \377. This syntax was first introduced in Perl 5.12. It avoids any confusion between octal escapes and backreferences. It also allows octal numbers beyond 377 to be used. E.g. \o{400} is the same as \x{100}. If you have any reason to use octal escapes instead of hexadecimal escapes then you should definitely use the new syntax. Because of this change, \o is now an error when it doesn’t form a valid octal escape. Previously \o was a literal o and \o{377} was a sequence of 337 o‘s.

In free-spacing mode, whitespace between a quantifier and the ? that makes it lazy or the + that makes it possessive is now ignored. In Perl this has always been the case. In PCRE 8.33 and prior, whitespace ended a quantifier and any following ? or + was seen as a second quantifier and thus an error.

The shorthand \s now matches the vertical tab character in addition to the other whitespace characters it previously matched. Perl 5.18 made the same change. Many other regex flavors have always included the vertical tab in \s, just like POSIX has always included it in [[:space:]].

Names of capturing groups are no longer allowed to start with a digit. This has always been the case in Perl since named groups were added to Perl 5.10. PCRE 8.33 and prior even allowed group names to consist entirely of digits.

[[:<:]] and [[:>:]] are now treated as POSIX-style word boundaries. They match at the start and the end of a word. Though they use similar syntax, these have nothing to do with POSIX character classes and cannot be used inside character classes. Perl does not support POSIX word boundaries.

The same changes affect PHP 5.5.10 (and later) and R 3.0.3 (and later) as they have been updated to use PCRE 8.34.

RegexBuddy and RegexMagic have been updated to support the latest versions of PCRE, PHP, and R. Older versions that were previously supported are still supported, so you can compare or convert your regular expressions between the latest versions of PCRE, PHP, and R and whichever version you were using previously.

© Regular-Expressions.info or respective owner

Related posts about Regex Libraries