Hacker News new | ask | show | jobs
by contingencies 4641 days ago
Mainland China and Taiwan did something similar. I actually went to one of the newly-post-Unicode meets of Academica Sinica in Taipei where hardcore ancient Chinese academics and computational linguists were discussing unsolved conversion issues for some of those creatures.
1 comments

How often does UTF-8 update to account for these issues?
I think basically people who need to communicate beyond a certain age tend to avoid Unicode and just use images, and CJK Unicode is essentially fixed, even if still changing slowly now. More info here: http://www.unicode.org/reports/tr38/

My overall impression was that super ancient characters (of which there are tens of thousands more, probably with many academic arguments as to their individual distinctions or similarities) have been left out of Unicode proper and are under some documentation/standardization effort by a separate group as a 'special use region' mapping within Unicode for their own use by agreement. I can't find their site, though I could swear I had it a few years back. Initially, "Han unification" was an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages (Chinese/Japanese/Korean) into a single set of unified characters and was completed for the purposes of Unicode in 1991 (Unicode 1.0). Unfortunately, not only did they try to get modern scholars to agree on a normalized set of characters, but they also wanted them to agree of semantic equivalence (and a subset of pronunciations!)... in all cases... across all time: obviously not a good way to please hardcore academics. (It should be noted that Vietnamese also used Chinese, that in ancient history (even ~3000+ years ago) numerous non-Chinese ideographic/logographic/alphabetic scripts existed in the south-western Chinese borderland, and that in modern Chinese, certain surviving characters today seen as 'Cantonese' (from southern Chinese coast to east of Vietnam) are a surviving relic of this arguably greater and prodominantly Southern Chinese culture of ideogrammatic innovation). Some of the scripts still survive archaeologically, others survive in literary reference, and some (though primarily alphabetic, save for the Naxi Dongba script, an understanding of which is critically endangered to lost now despite government efforts at preservation) are still alive today... often with government reforms or some 19th century debris of Jesuit or other religious meddling.

That's fascinating!

I can only imagine both the pressure and push back to 'get it right' from academics. Not only for their own language, but in a competitive sense either other counties.

Great reply!

There's not much pressure. I don't think anyone reasonably expects the Unicode consortium to fix the problems with Han Unification at all - mainly because Unicode is already too pervasive, and the cost to update existing technology to make it compatible would be too big.

There's specialized software, encodings and fonts that can be downloaded for writing traditional characters such as Mojikyo (http://www.mojikyo.org/PWU8N/index.php), but any text is basically incompatible with other software, including the web, except via converting them to images.

It's unlikely this will change. Technological progress is more important to most people, and anything not in Unicode will eventually be lost in time, just like spoken languages disappear every year as state educations force a standardized language on people.

Mojikyo! That's the one. Their site has changed a lot since I last saw it (probably ~5 years back).

On the disappearance meme, I would posit that "writing almost never dies out anymore, it just gets progressively more obscure".

On the spoken languages meme, I also volunteer on occasion for the World Oral Literature Project (Cambridge/Yale) @ http://oralliterature.org/ .. I've also been contemplating heading up to Assam in India, maybe Bhutan and Nepal to do what little recording I can manage.

UTF-8 encodes unicode code points, so its unicode or some external entity that converts between character sets that have to deal with those issues, not UTF-8

UTF-8 would pretty much only need to be updated if the unicode standard redefines what a code point is (e.g. starts using floating point, decimals, imaginary numbers or something else that is also unlikely to happen)

> UTF-8 would pretty much only need to be updated if the unicode standard redefines what a code point is (e.g. starts using floating point, decimals, imaginary numbers or something else that is also unlikely to happen)

Or if they decide that they need more codepoints, so some invalid-but-possible UTF-8 byte sequences suddenly become valid.

There's no reason for that, UTF-8 is only there to encode Unicode codepoints, and the whole range of codepoints (including the 80% not yet attributed) can be expressed in UTF-8.
UTF-8 has only been updated once (to remove 5 and 6 byte sequences, to limit it to the same range of values that UTF-16 can express).

New versions of Unicode are standardized every year or two.