Unicode Regex; Invalid XML characters
Posted
by Ambush Commander
on Stack Overflow
See other posts from Stack Overflow
or by Ambush Commander
Published on 2008-12-29T06:51:44Z
Indexed on
2010/03/27
7:23 UTC
Read the original article
Hit count: 270
The list of valid XML characters is well known, as defined by the spec it's:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
My question is whether or not it's possible to make a PCRE regular expression for this (or its inverse) without actually hard-coding the codepoints, by using Unicode general categories. An inverse might be something like [\p{Cc}\p{Cs}\p{Cn}], except that improperly covers linefeeds and tabs and misses some other invalid characters.
© Stack Overflow or respective owner