regex to filter all but whitelisted characters from a multi-language string
- by jeroen
I am trying to cleanup a string coming from a search box on a multi-language site.
Normally I would use a regex like:
$allowed = "-+?!,.;:\w\s";
$txt_search = preg_replace("/[^" . $allowed . "]?(.*?)[^" . $allowed . "]?/iu", "$1", $_GET['txt_search']);
and that works fine for English texts.
However, now I need to do the same when the texts entered can be in any language (Russian now, Chinese in the future).
How can I clean up the string while preserving "normal texts" in the original language?
I though about switching to a blacklist (although I´d rather not...) but at this moment the regex just completely destroys all original input.