Hacker News new | ask | show | jobs
by michaelochurch 4795 days ago
If you think of a string as a sequence of code points (integers) you get a correct but inefficient model (UTF-32).

Character is a context in which we read integers and currently we don't use more than a couple hundred thousand of those.

UTF-8 is a data-compression technique taking advantage of the fact that smaller code points are used more often, and the largest ones (which require 5+ bytes, because any compression algorithm expands some inputs) are, currently, not standardized and effectively never used.

2 comments

In Python 3, there is a clear distinction between string objects and bytes objects. String objects are a sequence of Unicode code points, and bytes objects are a sequence of 0-255 byte values.

> Character is a context in which we read integers and currently we don't use more than a couple hundred thousand of those.

I don't understand this sentence.

> UTF-8 is a data-compression technique taking advantage of the fact that smaller code points are used more often, and the largest ones (which require 5+ bytes, because any compression algorithm expands some inputs) are, currently, not standardized and effectively never used.

It's better to think of UTF-8 and UTF-32 as encodings: their role is to serialize strings into byte sequences and to deserialize byte sequences into strings. Some encodings are more efficient then others in different cases, but that doesn't change their role, and it's not necessary to understand this to avoid Unicode mistakes.

There never will be more than 4 bytes for UTF-8 because Unicode is restricted to 21 bits. Remember that all UTFs have to be able to represent all of Unicode and UTF-16 could not represent those “code points” where UTF-8 needs 5+ bytes.

Also I wouldn't say that UTF-8 is a compression scheme. SCSU is one but has its own share of problems. UTF-8 just happens to preserve ASCII compatibility which is an important property for Unix-like systems. Nothing more and nothing less. That is also happens to be more space-efficient for text that consists mostly of ASCII characters is merely a side-effect of that.

From the standpoint of an English speaker, UTF-8 is effectively a (good) compression scheme for Unicode, as opposed to using 2 or more bytes for every character.

I guess if I were German or Spanish (to say nothing of Asian languages), it would be the opposite of compression :-)