Hacker News new | ask | show | jobs
by bombela 33 days ago
In summary, Unicode code points (characters) are 32 bit. JavaScript manipulates Unicode in utf-16 for historical reasons, because at some point before Unicode, 16 bit was deemed enough (ucs-2). utf-16 run length encodes Unicode 32 codepoints into one or two code units. Splitting in a middle of a codepoints produces one invalid half string, and one semantically different half string.

emojies are a sequence of Unicode codepoints producing a single grapheme. Splitting in the middle of a grapheme will produce two valid strings, but with some funky half baked emoji. So for a text editor it makes sense to split between grapheme boundaries.

1 comments

> Unicode code points are 32 bit

21-bit, actually. It was supposed to be 32-bit, but UTF-16 caps out at 21-bit, so they lopped eleven bits of potential from Unicode (and UTF-8, so no more six-byte encoding).

> at some point before Unicode

No, in the early days of Unicode.

> run length encodes

Um… what? RLE is a data compression thing, UTF-16 has nothing to do with it.

> 21-bit, actually. It was supposed to be 32-bit, but UTF-16 caps out at 21-bit, so they lopped eleven bits of potential from Unicode (and UTF-8, so no more six-byte encoding).

Although, conveniently this means that UTF-8 bytes 0xF8 through 0xFF are always nonsense so the third party Rust type `ColdString` uses leading bytes 0xF8 through 0xFF in its 8 bytes of representation to indicate "I am an inline UTF-8 string, but, the UTF-8 starts in the next byte with a total length of N bytes" where N = byte - 0xF8

This leaves the continuation marker bits alone so ColdString can use those in that front byte to indicate "I am not actually inline data, I'm a pointer, rotate me so these indicator bits are my LSB and zero out them out to make me a 4 byte aligned pointer".

Which leaves all other 8 bytes values for the valid UTF-8 strings, which all begin with either ASCII or a byte between 0xC2 and 0xF4 inclusive.

>> Unicode code points are 32 bit

> 21-bit, actually

Less than that. https://en.wikipedia.org/wiki/Code_point#In_character_encodi...:

“The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 2¹⁶) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112”

That makes it log(1,114,112)/log(2) bit. That’s about 20,09.

(https://www.unicode.org/versions/Unicode17.0.0/ assigns 159,801 of them to characters)

Don't know what you are being down voted (or my grand parent comment for that matter). You are very correct.
Sorry, I was thinking of 0x1FFFFF as the end, but it’s 0x10FFFF. Forgetful.
If you are going to be pedantic, go all the way. 2^21 is 0 to 2_097_151. Unicode codepoint range is 0 to 1_114_111, slightly more than 2^20 (0 to 1_048_575).

I would argue that Unicode v2 onward; circa 1991 (Unicode Consortium and the ISO/IEC working together); is what anybody knows as Unicode with the 0 to 1_114_111 codepoints easily manipulated as a 32 bit value.

I meant variable length encoding, RLE encodes a number of successive repetition indeed.