Hacker News new | ask | show | jobs
by Piskvorrr 366 days ago
Once you start taxatively naming "these are the Only Blessed Ranges," you'll be bitten by the usual brouhaha "email address ends with .[a-z]{2,3}". We all know how it went, and ".[a-z]{2,4}" didn't cut it, either, not even in 2000.
1 comments

To add to the complexity, not all Chinese characters in use for names are representable in unicode. Perhaps at some point legal institutions must just define what the list of characters is that people can have as part of their name as listed on documentation. This reminds me of that 'what programmers believe about names' article from a while back.
If so, I think they would just need to be added to Unicode. Do you have an estimate how many are missing?
I as an interested bystander estimate it in the order of 10⁵. Email Ken Lunde for better insights.

Note that GP claimed "not representable" (not "not represented"). Based on what I know, that claim feels quite wrong.

> not all Chinese characters in use for names are representable in unicode

Why? How do you come to this conclusion?

Han unification[1] prevents the representation of all Chinese characters. There are multiple languages that use Chinese characters, but they don't all use the same characters. Unicode decided to only use Han Chinese characters, so names using other sorts of Chinese characters can't be written with Unicode. The Han "equivalent" characters can be used, but that looks weird.

Think of it as though Unicode decided that the letter "m" wasn't needed to write English text, since you can just write "rn" and it'll be close enough. Someone named "James" might want to have their name spelled correctly instead of "Jarnes", but that wouldn't be possible. Han unification did essentially this.

[1] https://en.wikipedia.org/wiki/Han_unification

I feel it's unlikely that this the explanation for what GGP had in mind. I postulate that names characters usually have no variants, thus do not undergo unification, or where there are variants, they are already encoded as Z variants, so the contention is also moot.

Prove me wrong with a counter-example.

𫟈 is U+2B7C8 "CJK Unified Ideo­graph- 2B7C8". 𛁻 is U+1B07B "Hentaigana Letter To-5".

Both character fall into the first category I mentioned, no variants.