Hacker News new | ask | show | jobs
by netvl 3392 days ago
> I've never seen people using UTF-8 deal with a code unit stage. They parse directly from bytes to code points.

Well, that's probably because in UTF-8 code unit is byte :)

Quoting https://en.wikipedia.org/wiki/UTF-8:

> The encoding is variable-length and uses 8-bit code units.

By definition, code unit is a bit sequence of a fixed size which can form code points. In UTF-8 you form code points using 8-bit bytes, therefore in UTF-8 code unit is byte. In UTF-16 it is a sequence of two bytes. In UTF-32 it is a sequence of four bytes.

1 comments

I said as much in my first comment, yes. I'm not sure if I'm missing something in your comment?

Code units may 'exist' on all three through the fiat of their definition, but they only have a visible function and require you to process an additional layer in UTF-16.