| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jcranmer 294 days ago
	> It's how the code point address space is defined. Not really. Unicode is still fundamentally based off of the codepoints, which go from 0 to 2^16 + 2^20, and all of the algorithms of Unicode properties operate on these codepoints. It's just that Unicode has left open a gap of codepoints so that the upper 2^20 codepoints can be encoded in UTF-16 without risk of confusion of other UCS-2 text.

1 comments

jeberle 293 days ago

You forgot `- 2^11` for the surrogate pairs. Gee, why isn't Unicode 2^21 code points? To understand the Unicode code point space you must understand UTF-16. The code space is defined by how UTF-16 works. That was my initial point.

link

jcranmer 293 days ago

If you're going to count the surrogate pairs as not-a-Unicode-codepoint, you should also count the other noncharacters: the last two codepoints on each of the 17 planes and the range U+FDD0-U+FDEF.

The expansion of Unicode beyond the BMP was designed to facilitate an upgrade compatibility path from UCS-2 systems, but it is extremely incorrect to somehow equate Unicode with UTF-16.

link

kbolino 293 days ago

FWIW there is an official term for "code points excluding surrogates", it is "Unicode scalar value".

link

jeberle 293 days ago

OK, I'm lost here. Why is there a 1:1 correspondence between the two?

link