Hacker News new | ask | show | jobs
by mlochbaum 483 days ago
The analysis would make a lot more sense if it dropped every big O and computed an expected number of comparisons. Of course, the real issue is that the whole thing relies on an empirically-determined length of √8√n for the number of lists in the rainbow, when the theoretical justification isn't there and clearly the worst case is n/2. I'm not seeing any awareness that the expected number of comparisons has to be at least log_2(n!), and am starting to see some worrying signs that the author thinks averages like n log log n are possible. I'd initially assumed that "somewhere between O(n) and O(n log_2 n)" statement meant in the presence of asymptotically large natural runs, but there's a claim that the runs "need not be of any minimum size" and another that the algorithm can "capitalize on natural runs" that seems to be referring to the benchmarks on "purely random values".
1 comments

Ah, the README prior to commit b98f724 does claim O(n log(1.5 ln(n))) asymptotic runtime. This is based on a flawed analysis assuming rainbow endpoints will be evenly distributed, which the paper presents before describing the problem that each inserted element pushes one of those endpoints away from the middle. Bit of a perpetual motion machine search: initial results showed the promise of this approach, some details got in the way but surely there's some trick to work around them...

Still there's the benchmark. I remembered timsort.txt shows comparison counts relative to the log_2(n!) bound and it's only like 1% worse, so the >30% improvement over Timsort at the high end has to come from some other factor. I'm thinking it is actually cache effects—not because the algorithm has inherently better cache properties, as it's just a merge sort at large scales, but because it copies the array to ints internally rather than using a generic comparison. That would be a poor result, as good comparison sorts compiled for 4-byte data are many times faster than Timsort.

I want to thank you for your analysis! I'm the "timsort" guy, and I'm asked to look at all sorts of things. The devil is in the details, and I've given up bothering to look unless a paper spells out sufficient details up front. Else it's a pit I'd rather not get sucked into.

As general things, timsort was aimed at exploiting all kinds of pre-existing order, not random lists. The only goal for the latter was to be within reach of CPython's previous highly tuned samplesort implementation (like quicksort on steroids, with much less data movement than mergesorts, but more comparisons).

As you say, for randomly ordered input it gets remarkably close to the information-theoretic lower bound on # of compares. So that's not it.

"Cache effects", maybe, but that's a pit to dig into. I'll note that while the newer "powersort" merge strategy is elegant and provably near-optimal by some relevant measures, in practice it doesn't really appear to run any faster (although cases can be _contrived_ that make it - or the older method! - run faster).

Comparison specialization (to ints) could very well account for it. CPython's general comparison machinery is _very_ expensive, even for what turn out to be native machine ints (which CPython cannot know at compile-time: everything is deduced at runtime).

CPython has since grown new machinery to do a pre-pass over the list, and do a form of cheaper comparison specialization for what turn out to be homogeneous lists of suitable types (including "small enough" ints). That alone gives enough speedup to make some of the original choices sub-optimal. For example, binary insertion sort for short runs is no longer best then. That was aimed at minimizing worst-case # of compares, but as comparisons get cheaper that has less value.

There are also surprises in everything. For example, the worst case for binary insertion sort on most machines was _not_ reverse-ordered input, despite that it requires the most data movement. Instead the worst case was randomly ordered data. Why? Branch prediction. In randomly ordered data, each branch is unpredictable. In reverse-ordered data, "move to the left" is always the test outcome.

Another generality is that Python's sort cares more about speed for shorter lists than huge ones. "Big data" problems are more likely to use, e.g., extensions geared to "big data problems", like numpy for giant arrays of floats. In those contexts more suitable _algorithms_ exist, like radix sorts, or multi-key quicksorts for lists of long strings with long shared prefixes.

Python's niche is more in, e.g., web services, where a great many sorts of shorter lists are common.

Bottom line: there is no one "best" sorting algorithm. I keep an eye out for newer developments, but haven't seen anything better yet for what Python is aiming at. I was, e.g., very impressed by pdqsort - but it's not a stable sort, and that alone makes it a non-starter for general Python use.

Whoa, hey Tim! Very much agree on no "best" sorting algorithm. The generic-comparison case is one I keep idly considering, although more focused on long lists than short ones. There's been a lot of improvement lately in sorting small datatypes (arguably kicked off by fluxsort), but really, who has so many 4-byte ints laying around that sorting them is a bottleneck? A point Orson Peters made when presenting glidesort[0] was that lists with many repeated duplicates are common even in generic data, as in ordering customers by city, and quicksort seems to be the best way to deal with such a list. However glidesort/driftsort is still largely oriented towards smaller types, and its mergesort-quicksort hybridization can pretty easily make bad choices, as I described at [1] and especially [2] (slightly surprised I didn't mention binary insertion sort in that first section). So this approach feels like it's still immature to me.

[0] https://archive.fosdem.org/2023/schedule/event/rust_glidesor...

[1] https://mlochbaum.github.io/BQN/implementation/primitive/sor...

[2] https://mlochbaum.github.io/BQN/implementation/primitive/sor...

You're much more up to date on the details of recent developments than I am. It's faded into a background interest (although a persistent one) for me.

One thing that wasn't clear to me about JesseSort: in what way(z) are Rainbows believed to be an improvement over the simpler scheme used by the older "patience sorting"? They both maintain a collection of sorted sublists merged at the end, both use binary search to find the right sublist to add "the next" array element to, and both excel at "low diversity" inputs (if there are K distinct values, patience sorting will build at most K sublists).

As I recall, full-blown patience sorting isn't "naturally stable". Searching through the JesseSort paper, I didn't find it addressed, and the details were too fuzzy to guess offhand. While this is in no way a principled objection, it just _seemed_ "too complicated" to me. Then again, my intuition for such things has served me well before ;-)

I'd forgotten about patience sorting! Not that I was ever exactly familiar with it. JesseSort seems nearly identical, except that it allows insertion to either side of the lists, tries the previous index before doing a binary search, and uses pairwise instead of k-way merge at the end. The expected Θ(√n) list count is the same as well. An obvious advantage of going double-ended is that you get automatic O(n) sorting if the input's an ascending or descending run, instead of just one of those.

I think both algorithms can be made stable; if there's a barrier for patience sort it would be that the merge phase pops off the end of the sorted lists so it's going in the opposite direction as insertion? For JesseSort, an element that's equal to a previous one can only be inserted to the same list, or a more central one (created later). So you need to merge sublists giving priority to the outer ones, and also never append an element to the bottom of a sublist if it's equal to that list's current minimum. Doesn't look like JesseSort as implemented in jessesort_c.pyx does either of these: it merges non-adjacent lists and has a backwards merge by virtue of using < instead of <=, and always inserts to the innermost list possible instead of being strict about the bottom side. Which is understandable as being strict has a major downside. If you get a string of equal values in the bottom half of the range, you'd be stuck putting each one in a new sublist until you get to the midpoint and can create a new one where these values start going to the top.

Ya, I'm just too old for bottomless pits anymore ;-) Patience sorting is still worth study, for its elegance, simple code, and real-world practicality in the context of solving the longest increasing subsequence problem. As a full-blown sorting algorithm, it only excels at reverse-ordered and "low diversity" inputs, but that's in return for very simple, uniform, and easily understood code.

Of course any of these could be changed to use the powersort merge strategy, if desired. It's elegant and principled. But it also exploits that timsort and powersort keep the _number_ of runx simultaneously active no more than log2(N). Needing to keep track of potentially O(N) sublists simultaneously is potentially burdensome.

In any case, I'm most interested in adaptive sorting, since partial order so very often exists in the real world (as you noted that Orson Peters noted, this is often due to "low diversity" as a consequence of sorting on one field of a complex record - but it's more than just that). Thanks to your good explanations, I'm persuaded that JesseSort isn't much more promising than patience sorting in that specific respect (yes, looks like JesseSort can handle ascending order in linear time too). So thanks again!

>This is based on a flawed analysis assuming rainbow endpoints will be evenly distributed

I guess you know this, but we don't need to do any complicated analysis to disprove such a big-O for a comparison-based sort. It's a well known information-theory result: each comparison can remove at most O(1) entropy, while the initial state has O(n lg n) entropy because that's how log(N!) grows (https://en.wikipedia.org/wiki/Stirling%27s_approximation).