| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Flimm 4787 days ago

In Python 3, there is a clear distinction between string objects and bytes objects. String objects are a sequence of Unicode code points, and bytes objects are a sequence of 0-255 byte values.

> Character is a context in which we read integers and currently we don't use more than a couple hundred thousand of those.

I don't understand this sentence.

> UTF-8 is a data-compression technique taking advantage of the fact that smaller code points are used more often, and the largest ones (which require 5+ bytes, because any compression algorithm expands some inputs) are, currently, not standardized and effectively never used.

It's better to think of UTF-8 and UTF-32 as encodings: their role is to serialize strings into byte sequences and to deserialize byte sequences into strings. Some encodings are more efficient then others in different cases, but that doesn't change their role, and it's not necessary to understand this to avoid Unicode mistakes.