UTF-8 bit representation

Posted by Yanick Rochon on Super User See other posts from Super User or by Yanick Rochon
Published on 2011-01-13T02:36:33Z Indexed on 2011/01/13 2:55 UTC
Read the original article Hit count: 246

Filed under:

I'm learning about UTF-8 standards and this is what I'm learning :

Definition and bytes used
UTF-8 binary representation         Meaning
0xxxxxxx                            1 byte for 1 à 7 bits chars
110xxxxx 10xxxxxx                   2 bytes for 8 à 11 bits chars
1110xxxx 10xxxxxx 10xxxxxx          3 bytes for 12 à 16 bits chars
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4 bytes for 17 à 21 bits chars

And I'm wondering, why 2 bytes UTF-8 code is not 10xxxxxx instead, thus gaining 1 bit all the way up to 22 bits with a 4 bytes UTF-8 code? The way it is right now, 64 possible values are lost (from 1000000 to 10111111). I'm not trying to argue the standards, but I'm wondering why this is so?

** EDIT **

Even, why isn't it

UTF-8 binary representation         Meaning
0xxxxxxx                            1 byte for 1 à 7 bits chars
110xxxxx xxxxxxxx                   2 bytes for 8 à 13 bits chars
1110xxxx xxxxxxxx xxxxxxxx          3 bytes for 14 à 20 bits chars
11110xxx xxxxxxxx xxxxxxxx xxxxxxxx 4 bytes for 21 à 27 bits chars

...?

Thanks!

© Super User or respective owner

Related posts about utf8