New Regular Expression Features in Java 8
- by Jan Goyvaerts
Java 8 brings a few changes to Java’s regular expression syntax to make it more consistent with Perl 5.14 and later in matching horizontal and vertical whitespace.
\h is a new feature. It is a shorthand character class that matches any horizontal whitespace character as defined in the Unicode standard.
In Java 4 to 7 \v is a character escape that matches only the vertical tab character. In Java 8 \v is a shorthand character class that matches any vertical whitespace, including the vertical tab. When upgrading to Java 8, make sure that any regexes that use \v still do what you want. Use \x0B or \cK to match just the vertical tab in any version of Java.
\R is also a new feature. It matches any line break as defined by the Unicode standard. Windows-style CRLF pairs are always matched as a whole. So \R matches \r\n while \R\R fails to match \r\n. \R is equivalent to (?\r\n|[\n\cK\f\r\u0085\u2028\u2029]) with an atomic group that prevents it from matching only the CR in a CRLF pair. Oracle’s documentation for the Pattern class omits the atomic group when explaining \R, which is incorrect. You cannot use \R inside a character class.
RegexBuddy and RegexMagic have been updated to support Java 8. Java 4, 5, 6, and 7 are still supported. When you upgrade to Java 8 you can compare or convert your regular expressions between Java 8 and the Java version you were using previously.