| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sparkie 1105 days ago

Codepoints themselves could technically all fit into 3 bytes (or 21 bits to be precise), but there is no standard 3-byte encoding. The highest Unicode codepoint is 0x10FFFF.

An idea for a variable width encoding of 1 to 3 bytes: Read the MSB of each byte: If it's 0, don't read any more bytes. If it's 1, read the next byte. Do the same (up to 3 times). The non MSB bits of each byte then make up the codepoint.

    0xxxxxxx                    (ASCII)
    1xxxxxxx 0xxxxxxx           (0x0080 - 0x3FFF)
    1xxxxxxx 1xxxxxxx 0xxxxxxx  (0x4000 - 0x1FFFFF)

If the Unicode range grew in future to require further bits you could use the same technique by allowing greater than 3 bytes.

    1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx (0x200000 - 0x10000000)

The obvious drawback to this approach is that it is inherently serial. You need to read each byte before considering the next, so it would perform worse than UTF-8 in most cases.

Another drawback is that it is not self-synchronizing, which is one of the benefits of UTF-8.

It also has the issue that you can represent some codepoints with more than one encoding: eg, put ASCII characters into 2 or 3 bytes. So you would need rules to use the minimal encoding for each codepoint.

As a space-saving technique, it may offer better density than UTF-8 or UTF-16 on some texts.

You could also use a fixed-width encoding of 24-bits to avoid the problem of reading it serially, but as computers typically work in powers of 2, you would align 24-bit values at 32-bit addresses and load them into 32-bit+ registers, so there's nothing to really gain in terms of performance here over UTF-32, but you could save a bit of space.

1 comments

ablob 1103 days ago

Isn't this just utf8 without information on how many of the following bytes belong to the glyph?

sparkie 1103 days ago

It's more compact than UTF-8 because fewer bits are used for the encoding itself. There are 24 bits, and only 3 bits are used as part of the encoding, with the other 21 bits representing the code-point. (Precisely the number we need to represent all Unicode).

In UTF-8, a 3-byte encoding uses 8-bits as part of the encoding, a full byte worth of bits for the encoding itself, leaving only 16-bits for the codepoint. If you need higher code-points you need to use 4 bytes, where 11 bytes are the encoding and 21 bytes are the codepoint.

So UTF-8 is space efficient for ASCII, but ~1/3 of the bits are used for the encoding in for non-ASCII, versus a fixed 1/8 of the bits used for the encoding above for all 1-3 bytes. The above has a fixed 12.5% space overhead over raw codepoints. UTF-8 has 12.5% only for ASCII, and ~33% overhead for everything else.

Although it is not self-synchronizing like UTF-8, you can synchronize a reliable stream by holding a buffer of the previous byte. If the previous byte's value is >=0x80, the current byte is part of the same character. If it's <0x80, the current byte is the start of a new character, so it's still possible to do substring matching etc, fairly efficiently but slightly less efficiently than UTF-8. It makes it suitable for file storage, but not ideal for transmission.

That said, most sane protocols will prefix a string with a length (in bytes), so self-synchronization is not always an issue.