Hacker News new | ask | show | jobs
by tkot 35 days ago
> it really is also a tool to best codify spoken language of the Slavs (in a sense, it is trivially provable that Cyrillic script is better adapted even to languages which do not use it today, but have to resort to digraphs or glyphs with diacritics — some are thus not using it to distance from a particular influence instead

I've heard this claim many times but never the reasoning behind it - by what metric is "ш" superior to "š" and so on?

1 comments

It's less pronounced with diacritics, but enter Unicode normal forms: you can represent š either as š, or s followed by a diacritic. When you want to compare two strings, you have to normalize them to ensure you are comparing apples to apples. I can guarantee most software is broken in that regard. For Cyrillic, it just works.

With digraphs (lj, nj, dž + sometimes dj for đ too), it's even worse. Even capitalization is ambiguous: sometimes it's Lj and other times it's LJ. Then you have words like konjugacija where nj is not a digraph.

Interestingly — and not many know this — Unicode includes separate codepoints for all of the digraphs too. While well-intentioned, it only makes the problem worse.

Digraphs are especially sucky when you try sorting strings in a phonebook order as LJ comes after L, so you've got ...LI, LK..., LZ, LJA... With exceptions, it is even worse.

> It's less pronounced with diacritics, but enter Unicode normal forms: you can represent š either as š, or s followed by a diacritic. When you want to compare two strings, you have to normalize them to ensure you are comparing apples to apples. I can guarantee most software is broken in that regard. For Cyrillic, it just works.

It's the same with Unicode encoding of Cyrillic letters - й (U+0439) can be written as й (и U+0438 + ◌̆ U+0306)

> Interestingly — and not many know this — Unicode includes separate codepoints for all of the digraphs too. While well-intentioned, it only makes the problem worse.

Based on your description it seems that the root cause of the issues is using two letters to represent the digraph - for example N (U+004E) J (U+004A) instead of NJ (U+01CA) - and the sorting issues would be identical if people typed Н (U+041D) Ь (U+042C)instead of Њ (U+040A).

What's the reason for the digraph being substituted by 2 letters in the first case more often than in the second case?

You are absolutely right that there are examples where Cyrillic as used by Slavic languages is not perfectly "clean" either, and it's certainly a lot more nuanced than my simplistic and absolutist claim.

Perhaps people misunderstood me: it is not a _technical_ property of Cyrillic (vs Latin) script per se, but a combination of historical setting and ability to adapt the script to the (smaller) group's language. This has led to Cyrillic scripts being _developed_ to be technically more suitable for Slavic languages, because where Latin script was used, there was not as much liberty (perceived or real).

I mean, either is just a set of pictograms representing parts of spoken words, and obviously, if developed similarly, there is no difference between them. But for Slavic languages they were _not_ developed similarly, which is my point.

So, it's not "trivially provable that Cyrillic is better suited to Slavic languages". But that "the symbols representtion we settled on in software has some difficulties disambiguatuong some, but not all cases of symbol use in a language, a problem that is not unique to Slavic languages, see Dutch IJ, Turkish ı/i, German ß etc."
Decoupling choice of script from "symbols representation" is a weird approach — this is how people type them out.

Yes, problems are not unique to Slavic languages, but at least for _some_ Slavic languages, Cyrillic has been taken to the most simplified form that is _accidentally_ easy to process on a computer too.

But yes, I was a bit too absolutist, I agree — as ever, everything is more nuanced, so perhaps not "trivially provable", but in "closer to full differentiation in graphical representation while being simple and unambiguous to process on a computer"?

Most Latin-based scripts are just as unambiguous ;)
But the point was if this holds true for Slavic languages: my claim was it does not, as supported by the discussion.