|
|
|
|
|
by netvl
3392 days ago
|
|
> I've never seen people using UTF-8 deal with a code unit stage. They parse directly from bytes to code points. Well, that's probably because in UTF-8 code unit is byte :) Quoting https://en.wikipedia.org/wiki/UTF-8: > The encoding is variable-length and uses 8-bit code units. By definition, code unit is a bit sequence of a fixed size which can form code points. In UTF-8 you form code points using 8-bit bytes, therefore in UTF-8 code unit is byte. In UTF-16 it is a sequence of two bytes. In UTF-32 it is a sequence of four bytes. |
|
Code units may 'exist' on all three through the fiat of their definition, but they only have a visible function and require you to process an additional layer in UTF-16.