Hacker News new | ask | show | jobs
by euazOn 393 days ago
Yes, that’s confusing and probably hard to find a good balance. Someone speaking Greek or Czech may expect to find their language around E (Ελληνικά) or C (Čeština), but nope, on Wiki it’s all the way after Z.
4 comments

The problem may be is that you need to set the locale in order to get certain alphabetization, but setting the locale won't happen until after the language is chosen.

A reasonable approach might be to sort the list of names by using, as the sort keys, the strings projected through a Unicode normalization function, followed by folding to upper case. Then Čestina gets mapped to CESTINA and at least appears among the C's.

Don’t special characters always go after the Latin alphabet? I think this is pretty common, and fairly expected behaviour. Of course nothing is perfect but I feel like the way Wikipedia handles it is consistent.
Not in the Czech alphabet: a, á, b, c, č, d, ď, e, é, ě, ...

Also, we regard 'Ch' as its own letter. So yeah, try sorting alphabetically. I'll wait.

    perl -E'Unicode::Collate::Locale->new(locale => 'cs')->sort … … …'
works. Test cases at https://prirucka.ujc.cas.cz/?action=view&id=900
I love some perl on Monday morning but how does this work when you don’t know the locale?
Then a system should fall back to DUCET which produces more or less sensible results across all locales.
Digraphs like Ch are common in a lot of languages. Wikipedia supports that fine on category pages. E.g. https://cs.wikipedia.org/wiki/Kategorie:CHKO_%C5%A0umava

If you want to see bizarre sort rules, look up how french sorts accent characters.

> If you want to see bizarre sort rules, look up how french sorts accent characters.

I tried to do this, but there do not appear to be any sources addressing this question.

I did find a French Stack Exchange question asking for this exact information, and complaining that there are no sources (other than an uncited wikipedia page) that address it. There is no answer posted, but there is a comment from a French guy suggesting that there are no official rules.

https://french.stackexchange.com/questions/54217/french-dict...

How were you imagining I would look this up?

Here is a blog post talking about it https://archives.miloush.net/michkap/archive/2004/12/31/3447...

Or a more technical version at https://www.unicode.org/reports/tr10/#Backward

Another case that is kind of weird is thai https://www.unicode.org/reports/tr10/#Rearrangement

> Here is a blog post talking about it

I notice that post suggests that Académie française specifies that accents should be sorted in reverse, and includes a link over the words "Académie française", and yet that link doesn't go to a supporting document.

A while ago I complained on this forum that Amazon's hyphenation for Kindle ebooks is abysmally bad. (Which is still true.) Someone responded to say that the hyphenation algorithm for English requires this. I pointed out that the hyphenation algorithm for English is a lookup table; each word has its hyphenation defined in the table, and when you need to hyphenate a word, you look up the hyphenation points.

Another response linked me to a paper describing how this table can be stored as a set of rules that provide hyphenation points in arbitrary letter sequences rather than dictionary words. That paper is very clear about its goals; it is an advance in data compression, proposing a method of storing a lookup table that takes less space than the table does. It carefully goes over how to produce the ruleset from the table.

But somewhere along the line, people confused the data compression algorithm (of storing the lookup table as a ruleset) for the hyphenation algorithm. They will now tell you with a straight face that a single ruleset that seems to have gone around represents the hyphenation algorithm for English, even if the word you want to hyphenate wasn't in the table that that ruleset was prepared from. And this is false.

It looks to me like something similar has happened in English speakers' understanding of French sorting order. It's very easy to explain why the example quadruplet has the sorting order it does:

    cote
    côte
    coté
    côté
(Note that the Stack Exchange question from 2024 and the blog post from 2004 use exactly the same example.)

These four words have two pronunciations, and the pronunciations are grouped with each other. After that, "cote" comes first by virtue of bearing no accents, and "o" comes before "ô" for the same reason.

What's happening here is that although French generally pretends that "e" and "é" are the same letter, they aren't, which forces -e (not pronounced) to come before -é (pronounced!). "o" and "ô" actually are the same letter, and can be ordered flexibly.

The rule "sort the accents in reverse" arises as a coincidence; it happens to be the case that this distinction is most significant at the end of French words. But French speakers would reject this ordering:

    cetot
    cétot
    cetôt
    cétôt
This doesn't come up because those words don't exist.
Well in my language "é" is absolutely not special, and should definitively be placed near "e" (to the point that uppercase é is often written E instead of É)
It depends on the language. Unicode defines rules for it: https://www.unicode.org/reports/tr10/
If I recall correctly, the default propose a first list that push items which are guessed most likely what the user expect, then a list more complete, and in any case let you filter by typing. I think it also can change the way it behave if you are connected and tweaked your preferences in the matter for your account.
Wikipedia uses UCA sort order in categories (depending on which lang wikipedia you are reading). Most other lists just sort using unicode codepoint order (in NFC). So it depends, but yes, for generated lists other than categories ascii characters usually come first.
That’s English hegemony. Languages have their own sorting that they expect. You can’t impose rules to other languages.

Of course at some point Unicode needs to be ordered, but you don’t get to impose technical details to people around the world because it matches with how English does it.

That’s where geo-ip guessing becomes relevant. Show a list with the most likely languages at the top.

Or use the Accept-Language. Since we already know the User understands that one, it's probably a reasonable choice for which sort order they expect too.
That’s not English sort order either.
Sorting by character codes, yes.

But in the language native locale, no.

I guess the default (when no language is specified) is Unicode order:

U+005A LATIN CAPITAL LETTER Z

U+010C LATIN CAPITAL LETTER C WITH CARON

U+0395 GREEK CAPITAL LETTER EPSILON

When serving that many languages, a search bar is paramount.