Hacker News new | ask | show | jobs
by tkot 37 days ago
> It's less pronounced with diacritics, but enter Unicode normal forms: you can represent š either as š, or s followed by a diacritic. When you want to compare two strings, you have to normalize them to ensure you are comparing apples to apples. I can guarantee most software is broken in that regard. For Cyrillic, it just works.

It's the same with Unicode encoding of Cyrillic letters - й (U+0439) can be written as й (и U+0438 + ◌̆ U+0306)

> Interestingly — and not many know this — Unicode includes separate codepoints for all of the digraphs too. While well-intentioned, it only makes the problem worse.

Based on your description it seems that the root cause of the issues is using two letters to represent the digraph - for example N (U+004E) J (U+004A) instead of NJ (U+01CA) - and the sorting issues would be identical if people typed Н (U+041D) Ь (U+042C)instead of Њ (U+040A).

What's the reason for the digraph being substituted by 2 letters in the first case more often than in the second case?

1 comments

You are absolutely right that there are examples where Cyrillic as used by Slavic languages is not perfectly "clean" either, and it's certainly a lot more nuanced than my simplistic and absolutist claim.

Perhaps people misunderstood me: it is not a _technical_ property of Cyrillic (vs Latin) script per se, but a combination of historical setting and ability to adapt the script to the (smaller) group's language. This has led to Cyrillic scripts being _developed_ to be technically more suitable for Slavic languages, because where Latin script was used, there was not as much liberty (perceived or real).

I mean, either is just a set of pictograms representing parts of spoken words, and obviously, if developed similarly, there is no difference between them. But for Slavic languages they were _not_ developed similarly, which is my point.