|
|
|
|
|
by chrismorgan
34 days ago
|
|
> Unicode code points are 32 bit 21-bit, actually. It was supposed to be 32-bit, but UTF-16 caps out at 21-bit, so they lopped eleven bits of potential from Unicode (and UTF-8, so no more six-byte encoding). > at some point before Unicode No, in the early days of Unicode. > run length encodes Um… what? RLE is a data compression thing, UTF-16 has nothing to do with it. |
|
Although, conveniently this means that UTF-8 bytes 0xF8 through 0xFF are always nonsense so the third party Rust type `ColdString` uses leading bytes 0xF8 through 0xFF in its 8 bytes of representation to indicate "I am an inline UTF-8 string, but, the UTF-8 starts in the next byte with a total length of N bytes" where N = byte - 0xF8
This leaves the continuation marker bits alone so ColdString can use those in that front byte to indicate "I am not actually inline data, I'm a pointer, rotate me so these indicator bits are my LSB and zero out them out to make me a 4 byte aligned pointer".
Which leaves all other 8 bytes values for the valid UTF-8 strings, which all begin with either ASCII or a byte between 0xC2 and 0xF4 inclusive.