Hacker News new | ask | show | jobs
by rcoveson 1099 days ago
My question is about the phrase up to 5. What in Unicode is up to 5? Codepoints are up to 4 in all the encodings I know. ZWJ sequences may as well be arbitrarily long. What is "up to 5"?
3 comments

The original quote for reference:

> So one Unicode character can be up to 5 bytes long and take up the same canvas space as 3 characters.

FWIW, I didn't read that as suggesting an upper bound of 5 bytes, but rather as an example using arbitrary numbers: N bytes of code units could, depending on the font providing the glyph(s) for the respective grapheme(s), could be rendered at M times the size of, say, the letter A, where N != M -- despite the font otherwise being monospaced. Which is just another way of saying that you must consult the font for the character widths involved.

I think you're reading that quote as an assertion that:

    For any grapheme G, G can be encoded in at most 5 bytes.
While what I think was being said was:

    There exists a grapheme G, where G is encoded in 5 bytes, and the respective glyph happens to be displayed at 3 times a single character (e.g. the letter A), despite the font otherwise being monospaced. Therefore you *must* consult the font for each glyph to correctly determine character widths.
Continue to the next sentence of context:

> You also need to read ahead as there are combination characters, for example a smiley combined with the color brow becomes a brown smiley.

Emphasis mine. Clearly combination characters are being treated separately.

Frankly I think it's crazy to read "up to 5 bytes" and not think that it suggest an upper bound. I think you're reaching for a highly questionably interpretation of a totally unambiguous clause. If the author meant to express what you're saying, they would certainly have written: "Some Unicode characters are 5 bytes long and take up the same canvas space as 3 characters". Which would still look incorrect if they followed it with the sentence "You also need to read ahead as there are combination characters...".

It is far more likely that the author is simply mistaken and should have said 4 bytes, and perhaps used the word "codepoint" instead of "character" in the original sentence. That's a perfectly understandable technical error, while the reinterpretation you're putting together would imply an error of colloquial language.

Codepoints themselves could technically all fit into 3 bytes (or 21 bits to be precise), but there is no standard 3-byte encoding. The highest Unicode codepoint is 0x10FFFF.

An idea for a variable width encoding of 1 to 3 bytes: Read the MSB of each byte: If it's 0, don't read any more bytes. If it's 1, read the next byte. Do the same (up to 3 times). The non MSB bits of each byte then make up the codepoint.

    0xxxxxxx                    (ASCII)
    1xxxxxxx 0xxxxxxx           (0x0080 - 0x3FFF)
    1xxxxxxx 1xxxxxxx 0xxxxxxx  (0x4000 - 0x1FFFFF)
If the Unicode range grew in future to require further bits you could use the same technique by allowing greater than 3 bytes.

    1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx (0x200000 - 0x10000000)
The obvious drawback to this approach is that it is inherently serial. You need to read each byte before considering the next, so it would perform worse than UTF-8 in most cases.

Another drawback is that it is not self-synchronizing, which is one of the benefits of UTF-8.

It also has the issue that you can represent some codepoints with more than one encoding: eg, put ASCII characters into 2 or 3 bytes. So you would need rules to use the minimal encoding for each codepoint.

As a space-saving technique, it may offer better density than UTF-8 or UTF-16 on some texts.

You could also use a fixed-width encoding of 24-bits to avoid the problem of reading it serially, but as computers typically work in powers of 2, you would align 24-bit values at 32-bit addresses and load them into 32-bit+ registers, so there's nothing to really gain in terms of performance here over UTF-32, but you could save a bit of space.

Isn't this just utf8 without information on how many of the following bytes belong to the glyph?
It's more compact than UTF-8 because fewer bits are used for the encoding itself. There are 24 bits, and only 3 bits are used as part of the encoding, with the other 21 bits representing the code-point. (Precisely the number we need to represent all Unicode).

In UTF-8, a 3-byte encoding uses 8-bits as part of the encoding, a full byte worth of bits for the encoding itself, leaving only 16-bits for the codepoint. If you need higher code-points you need to use 4 bytes, where 11 bytes are the encoding and 21 bytes are the codepoint.

So UTF-8 is space efficient for ASCII, but ~1/3 of the bits are used for the encoding in for non-ASCII, versus a fixed 1/8 of the bits used for the encoding above for all 1-3 bytes. The above has a fixed 12.5% space overhead over raw codepoints. UTF-8 has 12.5% only for ASCII, and ~33% overhead for everything else.

Although it is not self-synchronizing like UTF-8, you can synchronize a reliable stream by holding a buffer of the previous byte. If the previous byte's value is >=0x80, the current byte is part of the same character. If it's <0x80, the current byte is the start of a new character, so it's still possible to do substring matching etc, fairly efficiently but slightly less efficiently than UTF-8. It makes it suitable for file storage, but not ideal for transmission.

That said, most sane protocols will prefix a string with a length (in bytes), so self-synchronization is not always an issue.

I read this as colloquial English for "around" or "approximately". Not setting a bound limit, but setting an example size.
But you also read it as referring to ZWJ sequences? So the author has picked a number that is actually below average and they've worded it as up to...?

Saying a ZWJ sequence can be "up to 5 bytes" is like saying "the current generation of Intel processors run at clock speeds of up to 2 GHz".

If they were referring to ZWJ sequences (I don't think they were; I think they were just misremembering the maximum encoded length of a codepoint) and they had said "up to 35 bytes", then I might agree with you. It's still not technically accurate, but it's a reasonable colloquial usage, like saying "human males can grow up to seven feet tall".

I think you are trying to read something that wasn't meant to be technical documentation as if it was trying to be exact technical specifications. I'm not the original author, so I don't have reason to litigate this any further, and I'm not sure what you are arguing about at this point.
You've now replied to me up to 2 times.
Sorry I meant codepoint/characters, but it would not suprise me if there existed an encoding or language where my wording would be technically correct, but I do not know of any such encoding. I also did not know that there exist more then 5 combinations in Unicode, but I'm not supprised and my implementation is probably buggy. But I do challange you to test how well your favourite editor (terminal emulator cough) handles Unicode emojis.
UTF-8's original specification included 5-byte and 6-byte encodings to cover the complete astral plane (31-bit code points), but later specifications have marked those "invalid" today due to the current 21-bit limit of UTF-16 and to align both specifications for now rather than fix the bugs in UTF-16 (or scratch UTF-16 altogether). In theory, UTF-8 can even extend beyond 6-byte encodings (and UTF-32 into 8-byte encodings and beyond) if the next plane (63-bit code points) or the one after that ever needed to open up. (No one expects that any time soon, of course. Today Unicode is nowhere close to in danger of filling 21-bits much less 31. That would be a massive shock and the compatibility headache would be terrible with UTF-16 breaking or today's software breaking that hard codes the assumption that UTF-8 should never go past 4-byte encodings.)
If it wouldn't surprise you then I think you should recalibrate your feelings about how surprising Unicode encodings are. There aren't very many of them, they haven't changed in a very long time, and they don't deal with any of the stuff that makes Unicode very complicated (collation, combination characters, etc). They just encode 21-bit integers, albiet sometimes in a highly convoluted way for backwards-compatibility reasons (UTF-16). It's not the kind of thing that needs to be estimated, or where a layer of FUD is warranted (as it kind of is with combination characters). When talking of codepoints, it's just "up to 4 bytes", high confidence, nothing more to it.