| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tkot 82 days ago
	> it really is also a tool to best codify spoken language of the Slavs (in a sense, it is trivially provable that Cyrillic script is better adapted even to languages which do not use it today, but have to resort to digraphs or glyphs with diacritics — some are thus not using it to distance from a particular influence instead I've heard this claim many times but never the reasoning behind it - by what metric is "ш" superior to "š" and so on?

1 comments

necovek 82 days ago

It's less pronounced with diacritics, but enter Unicode normal forms: you can represent š either as š, or s followed by a diacritic. When you want to compare two strings, you have to normalize them to ensure you are comparing apples to apples. I can guarantee most software is broken in that regard. For Cyrillic, it just works.

With digraphs (lj, nj, dž + sometimes dj for đ too), it's even worse. Even capitalization is ambiguous: sometimes it's Lj and other times it's LJ. Then you have words like konjugacija where nj is not a digraph.

Interestingly — and not many know this — Unicode includes separate codepoints for all of the digraphs too. While well-intentioned, it only makes the problem worse.

Digraphs are especially sucky when you try sorting strings in a phonebook order as LJ comes after L, so you've got ...LI, LK..., LZ, LJA... With exceptions, it is even worse.

link

tkot 81 days ago

> It's less pronounced with diacritics, but enter Unicode normal forms: you can represent š either as š, or s followed by a diacritic. When you want to compare two strings, you have to normalize them to ensure you are comparing apples to apples. I can guarantee most software is broken in that regard. For Cyrillic, it just works.

It's the same with Unicode encoding of Cyrillic letters - й (U+0439) can be written as й (и U+0438 + ◌̆ U+0306)

> Interestingly — and not many know this — Unicode includes separate codepoints for all of the digraphs too. While well-intentioned, it only makes the problem worse.

Based on your description it seems that the root cause of the issues is using two letters to represent the digraph - for example N (U+004E) J (U+004A) instead of Ǌ (U+01CA) - and the sorting issues would be identical if people typed Н (U+041D) Ь (U+042C)instead of Њ (U+040A).

What's the reason for the digraph being substituted by 2 letters in the first case more often than in the second case?

link

necovek 78 days ago

You are absolutely right that there are examples where Cyrillic as used by Slavic languages is not perfectly "clean" either, and it's certainly a lot more nuanced than my simplistic and absolutist claim.

Perhaps people misunderstood me: it is not a _technical_ property of Cyrillic (vs Latin) script per se, but a combination of historical setting and ability to adapt the script to the (smaller) group's language. This has led to Cyrillic scripts being _developed_ to be technically more suitable for Slavic languages, because where Latin script was used, there was not as much liberty (perceived or real).

I mean, either is just a set of pictograms representing parts of spoken words, and obviously, if developed similarly, there is no difference between them. But for Slavic languages they were _not_ developed similarly, which is my point.

link

troupo 81 days ago

So, it's not "trivially provable that Cyrillic is better suited to Slavic languages". But that "the symbols representtion we settled on in software has some difficulties disambiguatuong some, but not all cases of symbol use in a language, a problem that is not unique to Slavic languages, see Dutch IJ, Turkish ı/i, German ß etc."

link

necovek 78 days ago

Decoupling choice of script from "symbols representation" is a weird approach — this is how people type them out.

Yes, problems are not unique to Slavic languages, but at least for _some_ Slavic languages, Cyrillic has been taken to the most simplified form that is _accidentally_ easy to process on a computer too.

But yes, I was a bit too absolutist, I agree — as ever, everything is more nuanced, so perhaps not "trivially provable", but in "closer to full differentiation in graphical representation while being simple and unambiguous to process on a computer"?

link

troupo 78 days ago

Most Latin-based scripts are just as unambiguous ;)

link

necovek 77 days ago

But the point was if this holds true for Slavic languages: my claim was it does not, as supported by the discussion.

link