| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mlochbaum 528 days ago

Whoa, hey Tim! Very much agree on no "best" sorting algorithm. The generic-comparison case is one I keep idly considering, although more focused on long lists than short ones. There's been a lot of improvement lately in sorting small datatypes (arguably kicked off by fluxsort), but really, who has so many 4-byte ints laying around that sorting them is a bottleneck? A point Orson Peters made when presenting glidesort[0] was that lists with many repeated duplicates are common even in generic data, as in ordering customers by city, and quicksort seems to be the best way to deal with such a list. However glidesort/driftsort is still largely oriented towards smaller types, and its mergesort-quicksort hybridization can pretty easily make bad choices, as I described at [1] and especially [2] (slightly surprised I didn't mention binary insertion sort in that first section). So this approach feels like it's still immature to me.

[0] https://archive.fosdem.org/2023/schedule/event/rust_glidesor...

[1] https://mlochbaum.github.io/BQN/implementation/primitive/sor...

[2] https://mlochbaum.github.io/BQN/implementation/primitive/sor...

1 comments

tim-peters 528 days ago

You're much more up to date on the details of recent developments than I am. It's faded into a background interest (although a persistent one) for me.

One thing that wasn't clear to me about JesseSort: in what way(z) are Rainbows believed to be an improvement over the simpler scheme used by the older "patience sorting"? They both maintain a collection of sorted sublists merged at the end, both use binary search to find the right sublist to add "the next" array element to, and both excel at "low diversity" inputs (if there are K distinct values, patience sorting will build at most K sublists).

As I recall, full-blown patience sorting isn't "naturally stable". Searching through the JesseSort paper, I didn't find it addressed, and the details were too fuzzy to guess offhand. While this is in no way a principled objection, it just _seemed_ "too complicated" to me. Then again, my intuition for such things has served me well before ;-)

link

mlochbaum 528 days ago

I'd forgotten about patience sorting! Not that I was ever exactly familiar with it. JesseSort seems nearly identical, except that it allows insertion to either side of the lists, tries the previous index before doing a binary search, and uses pairwise instead of k-way merge at the end. The expected Θ(√n) list count is the same as well. An obvious advantage of going double-ended is that you get automatic O(n) sorting if the input's an ascending or descending run, instead of just one of those.

I think both algorithms can be made stable; if there's a barrier for patience sort it would be that the merge phase pops off the end of the sorted lists so it's going in the opposite direction as insertion? For JesseSort, an element that's equal to a previous one can only be inserted to the same list, or a more central one (created later). So you need to merge sublists giving priority to the outer ones, and also never append an element to the bottom of a sublist if it's equal to that list's current minimum. Doesn't look like JesseSort as implemented in jessesort_c.pyx does either of these: it merges non-adjacent lists and has a backwards merge by virtue of using < instead of <=, and always inserts to the innermost list possible instead of being strict about the bottom side. Which is understandable as being strict has a major downside. If you get a string of equal values in the bottom half of the range, you'd be stuck putting each one in a new sublist until you get to the midpoint and can create a new one where these values start going to the top.

link

tim-peters 528 days ago

Ya, I'm just too old for bottomless pits anymore ;-) Patience sorting is still worth study, for its elegance, simple code, and real-world practicality in the context of solving the longest increasing subsequence problem. As a full-blown sorting algorithm, it only excels at reverse-ordered and "low diversity" inputs, but that's in return for very simple, uniform, and easily understood code.

Of course any of these could be changed to use the powersort merge strategy, if desired. It's elegant and principled. But it also exploits that timsort and powersort keep the _number_ of runx simultaneously active no more than log2(N). Needing to keep track of potentially O(N) sublists simultaneously is potentially burdensome.

In any case, I'm most interested in adaptive sorting, since partial order so very often exists in the real world (as you noted that Orson Peters noted, this is often due to "low diversity" as a consequence of sorting on one field of a complex record - but it's more than just that). Thanks to your good explanations, I'm persuaded that JesseSort isn't much more promising than patience sorting in that specific respect (yes, looks like JesseSort can handle ascending order in linear time too). So thanks again!

link