|
|
|
|
|
by anonymoushn
1384 days ago
|
|
> timsort and pdqsort (two comparison sort methods) and radix sort (a non-comparative sorting algorithm) have different albeit overlapping domains of applicability. Most comparators are of the form "compare by this, then if tied, compare by that, then if tied, compare by the other thing" which is pretty well suited to radix sort. You are correct though. > While I would reach for a comparison sort method if I had a large number of arbitrary-length Unicode strings, which I wanted to sort in a case-ignoring order. It seems like the radix sort is likely to be a lot faster for this too, mainly due to cache effects. If you have a dataset in mind I'll be happy to give it a shot. > Also, I found timsort faster than radix sort when there was a small number (as I recall, <100 or so) of elements. For sure. 100 isn't so far from the threshold where a radix sort should fall back to something else anyway. |
|
Yes, the general approach is to convert the input data into a fixed-length bit-string with the same sort order as the input.
Your example assumes that construction overhead is short. If tie-breaking is rare, and breaking the tie requires an expensive operation, then the trade-off point for radix might be much higher than 100 elements.
The fixed-length requirement works well for small items with relatively equal-length fields. Ragged items, like Wikipedia titles, causes a problem. There is one title which is 253 bytes long. Now, Wikipedia titles are limited to 255 bytes, so radix is certainly directly applicable, but 1) it changes the trade-off point, and 2) reduces cache effects.
Finally, it requires a sort-order-preserving transformation. I mentioned case-insensitive collation of Unicode strings as a well-known difficult problem. I do not believe there is mapping to an order-preserving representation which can be done bitwise. At the very least, it will be difficult to support all of the collation styles that currently exist (eg, French collation is different than Dutch).