Would it be possible to have a UTF-8-like encoding limited to 3 bytes per character?

Posted by dan04 on Stack Overflow See other posts from Stack Overflow or by dan04
Published on 2010-06-10T02:32:07Z Indexed on 2010/06/10 2:52 UTC
Read the original article Hit count: 404

Filed under:

character-encoding

|

hypothetical

UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space).

There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding to "overlong" encodings. If these had been available to encode characters, then more of them could have been represented by 2-byte or 3-byte sequences (of course, at the expense of making the implementation more complex).

Would it be possible to represent all 1,114,112 Unicode code points by a UTF-8-like encoding with at most 3 bytes per character? If not, what is the maximum number of characters such an encoding could represent?

By "UTF-8-like", I mean, at minimum:

The bytes 0x00-0x7F are reserved for ASCII characters.
Byte-oriented find / index functions work correctly. You can't find a false positive by starting in the middle of a character like you can in Shift-JIS.

© Stack Overflow or respective owner

Related posts about character-encoding

Harmonizing Character Encoding Between Imported Data and MySQL

as seen on Internet.com - Search for 'Internet.com'
MySQL's Latin-1 default encoding combined with MySQL 4.1.12's (or greater) UTF8 encoding allows the maximum number of characters codes, however incoming data with different character encoding can still present problems. Rob Gravelle shows you how to avoid problems before a lot of work is required… >>> More
Harmonizing Character Encoding Between Imported Data and MySQL

as seen on Internet.com - Search for 'Internet.com'
MySQL's Latin-1 default encoding combined with MySQL 4.1.12's (or greater) UTF8 encoding allows the maximum number of characters codes, however incoming data with different character encoding can still present problems. Rob Gravelle shows you how to avoid problems before a lot of work is required… >>> More
How to cross-reference many character encodings with ASCII OR UTFx?

as seen on Programmers - Search for 'Programmers'
I'm working with a binary structure, the goal of which is to index the significance of specific bits for any character encoding so that we may trigger events while doing specific checks against the profile. Each character encoding scheme has an associated system record. This record's leading value… >>> More
Determining default character set of platform in Java

as seen on Stack Overflow - Search for 'Stack Overflow'
I am programming in Java I have the code as: byte[] b = test.getBytes(); In the api it is specified that if we do not specify character encoding it takes the default platform character encoding. What is meant by "default platform character encoding" ? Does it mean the Java encoding or the OS… >>> More
Perl character encoding

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi People, I have an environment variable set in Windows as TEST=abc£ which uses Windows-1252 code page. Now when I run a perl program - 'test.pl', this environment value comes properly. When I call another perl code - 'test2.pl' from 'test1.pl' either by system(..) or Win32::process(..), the environment… >>> More

Related posts about hypothetical

DI and hypothetical readonly setters in C#

as seen on Programmers - Search for 'Programmers'
Sometimes I would like to declare a property like this: public string Name { get; readonly set; } I am wondering if anyone sees a reason why such a syntax shouldn't exist. I believe that because it is a subset of "get; private set;", it could only make code more robust. My feeling is that such… >>> More
How can I detect endianness on a system where all primitive integer sizes are the same?

as seen on Stack Overflow - Search for 'Stack Overflow'
(This question came out of explaining the details of CHAR_BIT, sizeof, and endianness to someone yesterday. It's entirely hypothetical.) Let's say I'm on a platform where CHAR_BIT is 32, so sizeof(char) == sizeof(short) == sizeof(int) == sizeof(long). I believe this is still a standards-conformant… >>> More
How to handle this unfortunately non hypothetical situation with end-users?

as seen on Programmers - Search for 'Programmers'
I work in a medium sized company but with a very small IT force. Last year (2011), I wrote an application that is very popular with a large group of end-users. We hit a deadline at the end of last year and some functionality (I will call funcA from now on) was not added into the application that… >>> More
Would it be possible to have a UTF-8-like encoding limited to 3 bytes per character?

as seen on Stack Overflow - Search for 'Stack Overflow'
UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space). There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding… >>> More
Variable Assignment and loops (Java)

as seen on Stack Overflow - Search for 'Stack Overflow'
Greetings Stack Overflowers, A while back, I was working on a program that hashed values into a hashtable (I don't remember the specifics, and the specifics themselves are irrelevant to the question at hand). Anyway, I had the following code as part of a "recordInput" method. tempElement = new hashElement(someInt); … >>> More