| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by scandum 1227 days ago

If the last presentation's benchmark still goes, blitsort is significantly faster.

Keep in mind that glidesort is partly derived from fluxsort and many of its techniques were directly inspired by fluxsort/quadsort, while several others are implemented in a similar or near-identical manner. This is in my opinion not adequately explained in the presentation on youtube.

The main difference is that glidesort uses timsort's approach to adaptive sorting rather than quadsort's, though it uses quadsort's novel method for cost-effective branchless merging.

Glidesort tries to access 4 memory regions instead of 2, primarily with the bidirectional partition, an approach I implemented in 2021, but rejected because it was causing degraded performance for expensive comparisons. This does however result in a very different visualization which can give the impression it is unrelated. The general approach to stable partitioning is likely nearly identical to fluxsort.

As for why blitsort is fast, it's likely due to trinity rotations and monobound binary searches. Possibly also a more optimal overall partitioning strategy.

I haven't spend too much time on blitsort however, so there are likely further improvements to be made.

1 comments

orlp 1227 days ago

> Keep in mind that glidesort is partly derived from fluxsort and many of its techniques were directly inspired by fluxsort/quadsort, while several others are implemented in a similar or near-identical manner. This is in my opinion not adequately explained in the presentation on youtube.

We've had an extensive email discussion on this, and I still have to publicly provide counterweight.

Yes, I did look at fluxsort/quadsort (mostly the README descriptions) while developing glidesort, I did get inspired by fluxsort's stable quicksort performance. And I did take quadsorts 'parity merge' idea for small sorts, as I described here 9 months ago: https://news.ycombinator.com/item?id=31101498. I'm open about this.

But you're not the first to do branchless merging, I was already aware of that in "Branch Mispredictions Don't Affect Mergesort". Ping-pong merges turns out were already described in "Patience is a Virtue: Revisiting Merge and Sort on Modern Processors".

The observation that the reason the 'parity merge' was so fast being instruction-level and memory-level parallelism is my own (you added this to the README after I publicly described this in the above linked comment). That we can then use it more generally as a bidirectional merge is my own. Using merge splitting and lazy merging to then apply this to more merges is my own. Interleaving multiple independent merge loops is my own. Extending the quicksort to then also use bidirectional partitioning is also my own idea (I don't have a crystal ball about you implementing and rejecting the idea in 2021). My pivot selection is my own. The whole extension of powersort to use lazy merging and more merge splitting for ping-pong merge opportunities is my own. The novel robustly adaptive combination of merge and quicksort is my own. I've spent months and months optimizing glidesort. Glidesort tackles problems fluxsort/quadsort don't even have to worry about in being unable to copy values, only move them, and dealing with panics from user-provided comparator functions, and comes out with minimal losses.

You do great work Igor, but I would appreciate it if you would stop try to claim mine as (mostly) yours.

link

scandum 1227 days ago

This is what quadsort's README stated in early 2021:

"Since the parity merge can be unrolled it's very suitable for branchless optimizations to speed up the sorting of random data. Another advantage is that two separate memory regions can be accessed in the same loop with no additional overhead. This makes the routine up to 2.5 times faster on random data."

I described what made it fast and it was obvious there was general applicability. If I understanding your reasoning correctly, because you took inspiration from my more advanced unguarded two directional merge, you feel like a guarded two directional merge is some new novel discovery.

This is a very contentious line of reasoning.

Subsequently you then proceed to give me zero credit for branchless merging in your Youtube presentation, instead referencing a branchless merge that gave a 10-20% speed-up instead of quadsort's 150% speed-up. You also claimed in the presentation there were no bidirectional branchless algorithms written to date, this is simply not true.

This does upset me. It would be a simple matter, and the right thing, to point out that quadsort introduced bidirectional branchless merging.

As for your work to increase the memory regions from 2 to 4. This is easily conceived, but indeed quite challenging to actually implement.

link

orlp 1226 days ago

I will retract that you didn't mention memory-level parallelism back then, I lazily looked at the git blame and assumed it was introduced then. That was sloppy of me, sorry.

> I described what made it fast

I still partially disagree. Memory-level parallelism is one aspect but I think instruction-level parallelism and hiding data dependencies from iteration to iteration are the bigger factors. That's mainly what I chose to fully exploit in glidesort by actively splitting merges into smaller merges, and implementing bidirectional quicksort. That was the main takeaway from my glidesort talk.

> and it was obvious there was general applicability. If I understanding your reasoning correctly, because you took inspiration from my more advanced unguarded two directional merge, you feel like a guarded two directional merge is some new novel discovery.

The novelty is pointing out that also the hiding of data dependencies and instruction-level parallelism is making it fast, and to then go on and exploit this to the maximum. E.g. if I turn off the merge splitting in glidesort (reducing the amount of parallel merges) performance degrades 17.5% on my machine.

Your README explicitly called them parity merges, and that the arrays have to be of equal length. You might find the extension to unbalanced merges and using it for all merges instead of only the smallsort obvious, I disagree. If it is obvious... why didn't you do it? Correct me if I'm wrong but I was under the impression quadsort only uses the bidirectional merges for the small base case where arrays have statically known (equal) sizes.

> Subsequently you then proceed to give me zero credit for branchless merging in your Youtube presentation, instead referencing a branchless merge that gave a 10-20% speed-up instead of quadsort's 150% speed-up.

Because I was already aware of it, from the earlier work. That they implemented it poorly is besides the point. You and I both know that a bad implementation of a good idea in sorting can easily destroy all performance gains.

The part that was novel to me from your work on quadsort was eliminating the bounds check on perfectly balanced merges and the original idea of merging from both ends. That I credit you for.

> You also claimed in the presentation there were no bidirectional branchless algorithms written to date, this is simply not true. > > This does upset me. It would be a simple matter, and the right thing, to point out that quadsort introduced bidirectional branchless merging.

I don't think I claimed that. And I have modified the slides now to explicitly mention quadsort on the slide for bidirectional merging.

link

scandum 1226 days ago

As I wrote previously, thank you for fully addressing all my concerns in your recent FOSDEM presentation.

While it would have been outside the scope of the presentation, and time being short, quadsort does present an interesting alternative to handling the merge length problem powsort tries to solve.

I've honestly been unable to detect any notable instruction-level parallelism on my own hardware. I suspect there may not be any? It'll be interesting to get to the bottom of this.

As for quadsort, it does contain a guarded bidirectional merge. I published it after you started working on glidesort however. It was always on my todo list, but I do get tired of programming from time to time.

As for unguarded parity merges, it was indeed one of my brighter moments when I came up with that.

Anyways, once again, thanks for addressing my concerns in the FOSDEM presentation, and if it wasn't clear, I think you did some really excellent work on glidesort.

link