|
|
|
|
|
by jcranmer
2387 days ago
|
|
Arabic numerals (0123456789) are not to be confused with the Arabic numerals (٠١٢٣٤٥٦٧٨٩). So the fact that kanji literally means "Chinese character" doesn't mean that kanji and hanzi should be considered the same script. The Latin script (that which I write right now) and the Cyrillic script both derived heavily from the Greek script, especially the capital letters--fully 60% of them are identical in Latin and Greek, even more if you include obsolete letters like digamma and lunate sigma (roughly F and C, respectively). Most of these homoglyphs furthermore share identical phonetic values. In retrospect, treating traditional Chinese, simplified Chinese, and Japanese kanji as different scripts seems like it would have been the better path. I don't know enough about the Korean and Vietnamese usage of Chinese characters to know if those scripts are themselves independent daughter scripts or complete imports of Chinese with a few extra things thrown in (consider Farsi's additions to Arabic, or Icelandic's þ and ð additions to Latin). |
|
The Japanese writing system differs from the Chinese one by having its own distincts scripts (hiragana, katakana), but most of its subset made of Chinese characters (kanji) is the same than the Chinese script. The most comprehensive Chinese character dictionary is a Japanese one (Daikanwa jiten), which give definition and Japanese readings and this is possible precisely because the script is the same.
The only differences are characters created for use in Japan (kokuji) which can be treated as an extension like the Vietnamese Nôm, characters simplified by the Japanese government (some jôyô kanji) and variation in some glyph's shape (黃/黄). So, treating the full inventory of these languages as different scripts wouldn't make more sense than encoding the English, French and Czech alphabets separately because few characters differ.
My opinion is that Han unification makes sense, but the mistake made was to encode the variant interpretation at application level (e.g. html lang tags), which is not portable. I don't know how Unicode variant form works in details (putting a trailing code to a character to indicate precisely which variant is meant) but something like that at text encoding level could ease a lot of pain.