| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by moonchild 793 days ago
	Unicode collation is very complex. If you wanted to sort according to that, then you would need to devise an isomorphism between unicode strings and bitstrings such that lexicographic ordering in the bitstring domain agrees with unicode collation in the unicode domain. I'm not saying it's impossible, but it's not at all obvious how to do it. On the other hand, if all you want is some total order for acceleration structures, then you can indeed just use an arbitrary encoding like utf8 and do the obvious thing.

3 comments

IncreasePosts 793 days ago

Isn't that exactly what the Unicode collation algorithm is?

https://unicode.org/reports/tr10

tl;dr from its wikipedia page:

> The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc

link

moonchild 792 days ago

Ah! You are right.

link

bawolff 793 days ago

You can just use the icu library to do that. Its a very common thing to do when working with unicode collations.

link

dsamarin 793 days ago

The obvious solution to this is to pre-compile a list of all possible Unicode strings up to the required length, sort them, and create a lookup table using their indices. I wonder for what kind of project this would be useful.

link

thfuran 793 days ago

If that works for you for anything but very small strings, I want to buy your computer.

link

remram 792 days ago

Even for 3-codepoint strings it already wouldn't fit in any computer that exists. And that is not sufficient to encode all 1-glyph strings.

link