|
|
|
|
|
by danbruc
3388 days ago
|
|
That is certainly true if you only look at the word sizes at different layers. But any implementation will at least logically start with a sequence of bytes, then turn them into code units according to the encoding scheme, group code units into minimal well-formed code unit subsequences according to the encoding form, and finally turn them into code points. While different layers may use words of the same size, there are still differences, for example what is valid and what is not. While for example U+00D800 is a perfectly fine code point, the first high-surrogate, 0x0000D800 is not a valid UTF-32 code unit. 0xC0 0xA0 is a perfectly fine pair of bytes, both are valid UTF-8 code units, and they could become the code point U+000020 if only 0xC0 0xA0 were not an invalid code unit subsequence. So yes, while I agree that UTF-16 is special in that sense that one has to deal with 8, 16 and 32 bit words, I don't think that one should dismiss the concept of code units for all encoding forms but UTF-16. There enough subtle details between the different layers so that the distinction is warranted. And that is actually something I really like about the Unicode standard, it is really precise and doesn't mix up things that are superficially the same. |
|
Not at all. I've never seen people using UTF-8 deal with a code unit stage. They parse directly from bytes to code points.
> While for example U+00D800 is a perfectly fine code point, the first high-surrogate, 0x0000D800 is not a valid UTF-32 code unit.
I thought that was an invalid code point. Where would I look to see the difference? Nevertheless I would expect most code to make no distinction between the invalidity of 0x0000D800 and 0x44444444, except perhaps to give a better error message.
> 0xC0 0xA0 is a perfectly fine pair of bytes, both are valid UTF-8 code units, and they could become the code point U+000020 if only 0xC0 0xA0 were not an invalid code unit subsequence.
If you say that they're correct code units then at what point do you distinguish bytes and code units? In practice almost nobody decodes UTF-8 with an understanding of code units, neither by that name nor any other name. They simply see bytes that correctly encode code points, and bytes that don't.
Especially if you say that C0 is a valid code unit despite it not appearing in any valid UTF-8 sequences.