Hacker News new | ask | show | jobs
by sparkie 1105 days ago
Codepoints themselves could technically all fit into 3 bytes (or 21 bits to be precise), but there is no standard 3-byte encoding. The highest Unicode codepoint is 0x10FFFF.

An idea for a variable width encoding of 1 to 3 bytes: Read the MSB of each byte: If it's 0, don't read any more bytes. If it's 1, read the next byte. Do the same (up to 3 times). The non MSB bits of each byte then make up the codepoint.

    0xxxxxxx                    (ASCII)
    1xxxxxxx 0xxxxxxx           (0x0080 - 0x3FFF)
    1xxxxxxx 1xxxxxxx 0xxxxxxx  (0x4000 - 0x1FFFFF)
If the Unicode range grew in future to require further bits you could use the same technique by allowing greater than 3 bytes.

    1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx (0x200000 - 0x10000000)
The obvious drawback to this approach is that it is inherently serial. You need to read each byte before considering the next, so it would perform worse than UTF-8 in most cases.

Another drawback is that it is not self-synchronizing, which is one of the benefits of UTF-8.

It also has the issue that you can represent some codepoints with more than one encoding: eg, put ASCII characters into 2 or 3 bytes. So you would need rules to use the minimal encoding for each codepoint.

As a space-saving technique, it may offer better density than UTF-8 or UTF-16 on some texts.

You could also use a fixed-width encoding of 24-bits to avoid the problem of reading it serially, but as computers typically work in powers of 2, you would align 24-bit values at 32-bit addresses and load them into 32-bit+ registers, so there's nothing to really gain in terms of performance here over UTF-32, but you could save a bit of space.

1 comments

Isn't this just utf8 without information on how many of the following bytes belong to the glyph?
It's more compact than UTF-8 because fewer bits are used for the encoding itself. There are 24 bits, and only 3 bits are used as part of the encoding, with the other 21 bits representing the code-point. (Precisely the number we need to represent all Unicode).

In UTF-8, a 3-byte encoding uses 8-bits as part of the encoding, a full byte worth of bits for the encoding itself, leaving only 16-bits for the codepoint. If you need higher code-points you need to use 4 bytes, where 11 bytes are the encoding and 21 bytes are the codepoint.

So UTF-8 is space efficient for ASCII, but ~1/3 of the bits are used for the encoding in for non-ASCII, versus a fixed 1/8 of the bits used for the encoding above for all 1-3 bytes. The above has a fixed 12.5% space overhead over raw codepoints. UTF-8 has 12.5% only for ASCII, and ~33% overhead for everything else.

Although it is not self-synchronizing like UTF-8, you can synchronize a reliable stream by holding a buffer of the previous byte. If the previous byte's value is >=0x80, the current byte is part of the same character. If it's <0x80, the current byte is the start of a new character, so it's still possible to do substring matching etc, fairly efficiently but slightly less efficiently than UTF-8. It makes it suitable for file storage, but not ideal for transmission.

That said, most sane protocols will prefix a string with a length (in bytes), so self-synchronization is not always an issue.