|
|
|
|
|
by eesmith
1380 days ago
|
|
> which is pretty well suited to radix sort Yes, the general approach is to convert the input data into a fixed-length bit-string with the same sort order as the input. Your example assumes that construction overhead is short. If tie-breaking is rare, and breaking the tie requires an expensive operation, then the trade-off point for radix might be much higher than 100 elements. The fixed-length requirement works well for small items with relatively equal-length fields. Ragged items, like Wikipedia titles, causes a problem. There is one title which is 253 bytes long. Now, Wikipedia titles are limited to 255 bytes, so radix is certainly directly applicable, but 1) it changes the trade-off point, and 2) reduces cache effects. Finally, it requires a sort-order-preserving transformation. I mentioned case-insensitive collation of Unicode strings as a well-known difficult problem. I do not believe there is mapping to an order-preserving representation which can be done bitwise. At the very least, it will be difficult to support all of the collation styles that currently exist (eg, French collation is different than Dutch). |
|
In general MSB radix sort will have to look at the same parts of the input elements as multi-key quick sort, but one hopes that it gets to make fewer passes over the array. A comparator-based sort would look at about the same parts of the input elements as well, but it would look at them many times more than necessary.
> Finally, it requires a sort-order-preserving transformation. I mentioned case-insensitive collation of Unicode strings as a well-known difficult problem. I do not believe there is mapping to an order-preserving representation which can be done bitwise. At the very least, it will be difficult to support all of the collation styles that currently exist (eg, French collation is different than Dutch).
The requirements are a bit underspecified, but I think these can be solved by unicode normalization + tolower + codepoint-wise comparison, which is probably what you'd do in your comparator for a comparison-based sort as well.
[0]: https://github.com/bingmann/parallel-string-sorting/blob/mas...
[1]: https://github.com/dendibakh/perf-challenge6/blob/Solution_R...