|
|
|
|
|
by rstuart4133
2389 days ago
|
|
I can possibly forgive them not coming up with UTF-8 first, and a few character specific mistakes. I can even forgive them for not coming up with UTF-8 at all, which is fact what happened because it was invented outside of the organisation and then common usage forced it onto them. What I can not forgive is coming up with something that allows the same string have multiple representations, which happens in UCS-2 / UTF-16, and because pre-composed codepoints overlap pre-existing characters. That mistake makes "string1" == "string2" meaningless in the general case. It's already caused exploits. I also can not forgive pre-composed codepoints for another reason: the make parsing strings error prone. This is because ("A" in "Åström") no longer works - it will return true in if Å is composite character. |
|
Let me explain again. We have these pesky diacritical marks. How shall they be represented? Well, if you go with combining codepoints, the moment you need to stack more than one diacritical on a base character then you have a normalization problem. Ok then! you say, let us go with precompositions only. But then you get other problems, such as that you can't decompose them so you can cleverly match 'á' when you search for 'a', or that you can't easily come up with new compositions the way one could with overstrike (in ASCII) or with combining codepoints in decomposed forms. Fine!, you say, I'll take that! But now consider Hangul, where the combinations are to form syllables that are themselves glyphs... the problems get much worse now. What might your answer be to that? That the rest of the world should abandon non-English scripts, that we should all romanize all non-English languages, or even better(?), abandon all non-English languages?
(And we haven't even gotten to case issues. You think there are no case issues _outside_ Unicode? That would be quite wrong. Just look at the Turkish dot-less i, or Spanish case folding for accented characters... For example, in French upcasing accented characters does not lose the accent, but in Spanish they do -or used to, or are allowed to be lost, and often are-, which means that accented characters don't round-trip in Spanish.)
No, I'm sorry, but the equivalence / normalization problem was unavoidable -- it did not happen because of "egos" on the Unicode Consortium, or because of "national egos", or politics, or anything.
The moment you accept that this is an unavoidable problem, life gets better. Now you can make sure that you have suitable Unicode implementations.
What? you think that normalization is expensive? But if you implement a form-insensitive string comparison, you'll see that the vast majority of the time you don't have to normalize at all, and for mostly-ASCII strings the performance of form-insensitivity can approach that of plain old C locale strcmp().