|
|
|
|
|
by michaelochurch
4795 days ago
|
|
If you think of a string as a sequence of code points (integers) you get a correct but inefficient model (UTF-32). Character is a context in which we read integers and currently we don't use more than a couple hundred thousand of those. UTF-8 is a data-compression technique taking advantage of the fact that smaller code points are used more often, and the largest ones (which require 5+ bytes, because any compression algorithm expands some inputs) are, currently, not standardized and effectively never used. |
|
> Character is a context in which we read integers and currently we don't use more than a couple hundred thousand of those.
I don't understand this sentence.
> UTF-8 is a data-compression technique taking advantage of the fact that smaller code points are used more often, and the largest ones (which require 5+ bytes, because any compression algorithm expands some inputs) are, currently, not standardized and effectively never used.
It's better to think of UTF-8 and UTF-32 as encodings: their role is to serialize strings into byte sequences and to deserialize byte sequences into strings. Some encodings are more efficient then others in different cases, but that doesn't change their role, and it's not necessary to understand this to avoid Unicode mistakes.