Hacker News new | ask | show | jobs
by sampo 849 days ago
I found an online version

https://icu4c-demos.unicode.org/icu-bin/collation.html

but there you first have to select the language from the drop-down menu. So in general, you would first need to know the country where your weather station was located, before you can correctly collate its name.

I don't believe the fastest entries are doing all this(?)

Edit: In the examples [1], the guy writes a Polish city using its English name "Cracow". So you can't choose the alphabetical ordering based on the geographical location of the weather station, but you need to somehow detect in which language its name is written in, in the data.

[1] https://www.morling.dev/blog/one-billion-row-challenge/

Edit2: I guess you could declare that either the "Default Unicode Collation Element Table (DUCET)", or perhaps the American English "en-US-u-va-posix" locale is the correct way to alphabetize.

1 comments

You could argue this is dirty data, all the data should be entered in the same language or have a field that specifies language which allows you to translate. I don't think it's possible to solve the problem perfectly as presented.