Hacker News new | ask | show | jobs
by ivanbakel 2066 days ago
As well as this being rather closed-minded, it's also not true. The contents of the 0000-FFFE codepoints are public knowledge, and the biggest users of space are:

  1. the private use area
  2. the general "CJK" area
The second of which has a truly mind-boggling number of characters, including every possible composite Hangul glyph used in modern Korean, despite them being constructable from the basic Hangul codepoints.

Emojis and other symbols which aren't used for language appear relatively rarely. Certainly there is no reason to believe that UCS-2 would be sufficient for writing if they were removed. The number of scripts included in Unicode would exhaust even the private use area, and UTF-16 would have been invented regardless.

2 comments

> [...] despite them being constructable from the basic Hangul codepoints.

Unicode strives for the round-trip compatibility with source character sets, and in this case KS X 1001 (KS C 5601 at that time) is a main culprit: it had 2,350 (out of 11,172) common syllables precomposed. But it happens that Korea had supplementary character sets beyond KS X 1001, which were subsequently added to Unicode 1.1 (up to some 6,000 characters), before it was decided that having an algorithmically derived section of all 11,172 syllables is better. This whole situation is now known as the "Hangul mess".

>The second of which has a truly mind-boggling number of characters, including every possible composite Hangul glyph used in modern Korean, despite them being constructable from the basic Hangul codepoints.

Also true of most Chinese characters, but the proposal to encode them component-wise was a no-go (for adoption in China IRRC) and separate character encodings was went with in the end. I never managed to dig up the reasons behind it.

What was adopted was an adoption of existing encodings mapping, as per rountrip convertability policy. If Hangul had a working composable encoding, it would've been used instead.
The problem is that Hangul had too many composable encodings from each vendor. As a result the government went to yet another standard (KS X 1001) that fits better to the ISO/IEC 2022 infrastructure. It was too late when the standardized composable encoding was specified as an annex to the original standard in 1992: Windows 95 didn't care about the annex and introduced their own extension to KS X 1001, now known as the code page 949 and standardized in the WHATWG Encoding standard [1].

[1] https://encoding.spec.whatwg.org/#index-euc-kr

Yes, and that's how Koreans got themselves a block up in 0x11xx for composable jamo, which means 3 bytes per jamo, and 9 bytes per vowel :-O