| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by deathanatos 3387 days ago

A "code unit" exists in UTF-8 and UTF-32; they are not unique to UTF-16.[1] UTF-8's relationship with code points is approximately the same as UTF-16's, except that UTF-8 systems tend to understand code points better because if they didn't, things break a lot sooner, whereas they mostly work in UTF-16.

Your entire argument that graphemes are a poor way to deal with unicode seems to be that current programming languages don't use graphemes, instead dealing in a mix of code units or points. But the article here shows a number of cases where that doesn't break down, and the person you're responding to clearly points out that, for the cases covered in the article, graphemes are the way to go (and he's correct).

Graphemes aren't always the correct method (and I don't think your parent was advocating that), just like code units or code points aren't always the right way to count. It's highly dependent on the problem at hand. The bigger issue is that programming languages make the default something that's often wrong, when they probably ought to force the programmer to choose, and so, most code ends up buggy. Worse, some languages, like JavaScript, provide no tooling within their standard library for some of the various common ways of needing to deal with Unicode, such as code points.

[1]: http://unicode.org/glossary/#code_unit

2 comments

throwaway399 3387 days ago

How would you implement graphemes compatibility if you can have unlimited number of code points combine into a grapheme? Designing an efficient storage solution for such text seems like a nightmare.

Any time you define an upper limit, someone will come up with more emojis that will require larger number of code points per grapheme.

link

Dylan16807 3387 days ago

> A "code unit" exists in UTF-8 and UTF-32; they are not unique to UTF-16.

Technically yes. But they are only "exposed" in UTF-16. In UTF-32 code points and code units are the same size, so you only have to deal with code points. In UTF-8 you only have to deal with code points and bytes. UTF-16 is unique in having something that is neither code point nor byte but sits in between.

link

danbruc 3387 days ago

That is certainly true if you only look at the word sizes at different layers. But any implementation will at least logically start with a sequence of bytes, then turn them into code units according to the encoding scheme, group code units into minimal well-formed code unit subsequences according to the encoding form, and finally turn them into code points.

While different layers may use words of the same size, there are still differences, for example what is valid and what is not. While for example U+00D800 is a perfectly fine code point, the first high-surrogate, 0x0000D800 is not a valid UTF-32 code unit. 0xC0 0xA0 is a perfectly fine pair of bytes, both are valid UTF-8 code units, and they could become the code point U+000020 if only 0xC0 0xA0 were not an invalid code unit subsequence.

So yes, while I agree that UTF-16 is special in that sense that one has to deal with 8, 16 and 32 bit words, I don't think that one should dismiss the concept of code units for all encoding forms but UTF-16. There enough subtle details between the different layers so that the distinction is warranted. And that is actually something I really like about the Unicode standard, it is really precise and doesn't mix up things that are superficially the same.

link

Dylan16807 3387 days ago

> But any implementation will at least logically start with a sequence of bytes, then turn them into code units according to the encoding scheme, group code units into minimal well-formed code unit subsequences according to the encoding form, and finally turn them into code points.

Not at all. I've never seen people using UTF-8 deal with a code unit stage. They parse directly from bytes to code points.

> While for example U+00D800 is a perfectly fine code point, the first high-surrogate, 0x0000D800 is not a valid UTF-32 code unit.

I thought that was an invalid code point. Where would I look to see the difference? Nevertheless I would expect most code to make no distinction between the invalidity of 0x0000D800 and 0x44444444, except perhaps to give a better error message.

> 0xC0 0xA0 is a perfectly fine pair of bytes, both are valid UTF-8 code units, and they could become the code point U+000020 if only 0xC0 0xA0 were not an invalid code unit subsequence.

If you say that they're correct code units then at what point do you distinguish bytes and code units? In practice almost nobody decodes UTF-8 with an understanding of code units, neither by that name nor any other name. They simply see bytes that correctly encode code points, and bytes that don't.

Especially if you say that C0 is a valid code unit despite it not appearing in any valid UTF-8 sequences.

link

netvl 3387 days ago

> I've never seen people using UTF-8 deal with a code unit stage. They parse directly from bytes to code points.

Well, that's probably because in UTF-8 code unit is byte :)

Quoting https://en.wikipedia.org/wiki/UTF-8:

> The encoding is variable-length and uses 8-bit code units.

By definition, code unit is a bit sequence of a fixed size which can form code points. In UTF-8 you form code points using 8-bit bytes, therefore in UTF-8 code unit is byte. In UTF-16 it is a sequence of two bytes. In UTF-32 it is a sequence of four bytes.

link

Dylan16807 3387 days ago

I said as much in my first comment, yes. I'm not sure if I'm missing something in your comment?

Code units may 'exist' on all three through the fiat of their definition, but they only have a visible function and require you to process an additional layer in UTF-16.

link

burntsushi 3387 days ago

> I thought that was an invalid code point.

Surrogate codepoints are indeed valid codepoints. It's just that valid UTF-8 is not allowed to encode surrogate codepoints, so the space of codepoints supported by UTF-8 is actually a subset of all Unicode codepoints. This subset is known as the set of Unicode scalar values. ("All codepoints except for surrogates.")

link

Dylan16807 3387 days ago

Those points cannot be validly encoded in any format. I suppose you can argue that they are valid-but-unusable in an abstract sense, since the unicode standard does not actually label any code points as valid/invalid, but if you were going to label any code points as invalid then those would be in the group.

link

danbruc 3387 days ago

You are certainly correct that it is common to not pay too much attention to what things are called in the specification, especially if you want to create a fast implementation. Logically you will still go through all the layers even if you operate on only one physical representation.

My admittedly quite limited experience with Unicode is from trying to exploit implementation bugs. And with that focus it is quite natural to pay close attention to the different layers in the specification in order to see where optimized implementations might miss edge cases.

And I am generally a big fan of staying close to the word of standards, if it does not cause unacceptable performance issues, I would always prefer to stick with the terminology of the standard even if it means that there will be transformations that are just the identity.

The distinction between code points and scalar values will for example become relevant if you implement Unicode meta data. There you can query meta data about surrogate code points even if a parser should never produce those code points.

link