Hacker News new | ask | show | jobs
by orbitur 3498 days ago
This is something that's been bugging me for years.

Why are there multiple representations of alphabet characters in Unicode? It seems reasonable to include accent marks, but what's the benefit in having a Cyrillic 'o' alongside a standard 'o' or the 2 or 3 other ASCII-lookalike sets of characters?

6 comments

The most important reason is semantics. If "O" and "0" look alike in a certain font, should we use the same character code for both? No, because they have different meaning.

Here are some contexts in which this semantic difference is important: search (compare search results for "cop" and "сор"), alphabetical sorting, text-to-speech, spellchecking, case conversion ("ATOM" -> "atom", but "АТОМ" -> "атом", note the difference between t-т and m-м).

There will never be agreement what's the set of distinct characters (also, what characters should be included, bitcoin logo, facebook logo?)). I see Unicode as a necessary evil. Due to its complexity most applications should treat Unicode text as black boxes.

I never rely on Unicode for computation. When receiving Unicode I always make sure it's in the ASCII range. It could be argued that there should never have been Unicode domain names but I guess Western people are very lucky that ASCII includes most of their characters...

> When receiving Unicode I always make sure it's in the ASCII range. [...] Western people are very lucky that ASCII includes most of their characters...

Please don't spread the myth of Western languages being encodable in ASCII, and don't pretend to support Unicode (or anything else than English) if you filter everything to ASCII.

The _only_ Western language that is encodable in ASCII is English.

Corollary: English is the only language that can be encoded in ASCII.

The other western languages have endless issues with text being encoded/stripped down to ASCII. e.g. French, Spanish, Portuguese, German...

As a german I can attest that I can very well converse (e.g Email) in ASCII. Although it's convenient to use Umlauts, which I do. And I also agree that French or Spanish might be less convenient.

But that was not my point. The point was about identifiers, such as DNS names.

One goal of Unicode has been lossless round-tripping between legacy encodings (to encourage adoption). If such an encoding contains both Latin and Cyrillic, they must be separate to enable that.
The (seemingly obstinate) answer is that they are different characters. The Russian H sounds like an N in English.

If you're transcribing a conversation at the UN and there is a mix of different languages the fact that "Het" is transcribed as a latin character set is information. Het may be a southern American group of people, or it could just be a Russian dude saying "no", even if it looks the same.

I understand that we're still burdened by intralanguage homonyms, but I appreciate the fact that it isn't complicated further.

the font metrics and hinting/kerning are likely language or dialect-specific
Compatibility with ISO8859. For example, for Cyrillic, the first 128 characters U+40xx match ISO8859-5.