How can I check if a binary string is UTF-8 in mysql?
Posted
by Piotr Czapla
on Stack Overflow
See other posts from Stack Overflow
or by Piotr Czapla
Published on 2010-02-04T13:05:46Z
Indexed on
2010/03/14
18:25 UTC
Read the original article
Hit count: 267
I've found a Perl regexp that can check if a string is UTF-8 (the regexp is from w3c site).
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
But I'm not sure how to port it to MySQL as it seems that MySQL don't support hex representation of characters see this question.
Any thoughts how to port the regexp to MySQL? Or maybe you know any other way to check if the string is valid UTF-8?
UPDATE: I need this check working on the MySQL as I need to run it on the server to correct broken tables. I can't pass the data through a script as the database is around 1TB.
© Stack Overflow or respective owner