Would it be possible to have a UTF-8-like encoding limited to 3 bytes per character?
Posted
by dan04
on Stack Overflow
See other posts from Stack Overflow
or by dan04
Published on 2010-06-10T02:32:07Z
Indexed on
2010/06/10
2:52 UTC
Read the original article
Hit count: 359
character-encoding
|hypothetical
UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space).
There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding to "overlong" encodings. If these had been available to encode characters, then more of them could have been represented by 2-byte or 3-byte sequences (of course, at the expense of making the implementation more complex).
Would it be possible to represent all 1,114,112 Unicode code points by a UTF-8-like encoding with at most 3 bytes per character? If not, what is the maximum number of characters such an encoding could represent?
By "UTF-8-like", I mean, at minimum:
- The bytes 0x00-0x7F are reserved for ASCII characters.
- Byte-oriented
find
/index
functions work correctly. You can't find a false positive by starting in the middle of a character like you can in Shift-JIS.
© Stack Overflow or respective owner