Hacker News new | ask | show | jobs
by moonchild 793 days ago
Unicode collation is very complex. If you wanted to sort according to that, then you would need to devise an isomorphism between unicode strings and bitstrings such that lexicographic ordering in the bitstring domain agrees with unicode collation in the unicode domain. I'm not saying it's impossible, but it's not at all obvious how to do it. On the other hand, if all you want is some total order for acceleration structures, then you can indeed just use an arbitrary encoding like utf8 and do the obvious thing.
3 comments

Isn't that exactly what the Unicode collation algorithm is?

https://unicode.org/reports/tr10

tl;dr from its wikipedia page:

> The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc

Ah! You are right.
You can just use the icu library to do that. Its a very common thing to do when working with unicode collations.
The obvious solution to this is to pre-compile a list of all possible Unicode strings up to the required length, sort them, and create a lookup table using their indices. I wonder for what kind of project this would be useful.
If that works for you for anything but very small strings, I want to buy your computer.
Even for 3-codepoint strings it already wouldn't fit in any computer that exists. And that is not sufficient to encode all 1-glyph strings.