|
|
|
|
|
by tkot
37 days ago
|
|
> It's less pronounced with diacritics, but enter Unicode normal forms: you can represent š either as š, or s followed by a diacritic. When you want to compare two strings, you have to normalize them to ensure you are comparing apples to apples. I can guarantee most software is broken in that regard. For Cyrillic, it just works. It's the same with Unicode encoding of Cyrillic letters - й (U+0439) can be written as й (и U+0438 + ◌̆ U+0306) > Interestingly — and not many know this — Unicode includes separate codepoints for all of the digraphs too. While well-intentioned, it only makes the problem worse. Based on your description it seems that the root cause of the issues is using two letters to represent the digraph - for example N (U+004E) J (U+004A) instead of NJ (U+01CA) - and the sorting issues would be identical if people typed Н (U+041D) Ь (U+042C)instead of Њ (U+040A). What's the reason for the digraph being substituted by 2 letters in the first case more often than in the second case? |
|
Perhaps people misunderstood me: it is not a _technical_ property of Cyrillic (vs Latin) script per se, but a combination of historical setting and ability to adapt the script to the (smaller) group's language. This has led to Cyrillic scripts being _developed_ to be technically more suitable for Slavic languages, because where Latin script was used, there was not as much liberty (perceived or real).
I mean, either is just a set of pictograms representing parts of spoken words, and obviously, if developed similarly, there is no difference between them. But for Slavic languages they were _not_ developed similarly, which is my point.