Hacker News new | ask | show | jobs
by AdamH12113 1375 days ago
> This version adds 4,489 characters, bringing the total to 149,186 characters. These additions include two new scripts, for a total of 161 scripts, along with 20 new emoji characters, and 4,193 CJK (Chinese, Japanese, and Korean) ideographs.

That seems like a lot of new CJK characters! How did they end up with so many new characters after so long? Is there some gradual process of adding historical or extremely rare characters, or were some deliberately left out of earlier versions?

3 comments

More like the former. There was indeed a deliberate omission in the past standard called Han unification [1], but it's now pretty much toned down thanks to the expansion of Unicode codepoint space in 2.0, following subsequent disunification processes and the eventual introduction of Ideographic Variation Database [2] to handle remaining cases.

[1] https://en.wikipedia.org/wiki/Han_unification

[2] https://unicode.org/ivd/

On the wikipedia page for "CJK Unified Ideographs Extension H", under "History"(https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extensi...), you can find dozens of linked documents describing why someone thought they should be added.

One random example I opened (https://www.unicode.org/L2/L2017/17099-haifeng-county-uax45....) is a 9 page PDF proposing a single character used for "congee shop signs in Haifeng County".

I continue to think Han Unification, or Unicode for CJK is not the best solution to the problem.