| This is what quadsort's README stated in early 2021: "Since the parity merge can be unrolled it's very suitable for branchless optimizations to speed up the sorting of random data. Another advantage is that two separate memory regions can be accessed in the same loop with no additional overhead. This makes the routine up to 2.5 times faster on random data." I described what made it fast and it was obvious there was general applicability. If I understanding your reasoning correctly, because you took inspiration from my more advanced unguarded two directional merge, you feel like a guarded two directional merge is some new novel discovery. This is a very contentious line of reasoning. Subsequently you then proceed to give me zero credit for branchless merging in your Youtube presentation, instead referencing a branchless merge that gave a 10-20% speed-up instead of quadsort's 150% speed-up. You also claimed in the presentation there were no bidirectional branchless algorithms written to date, this is simply not true. This does upset me. It would be a simple matter, and the right thing, to point out that quadsort introduced bidirectional branchless merging. As for your work to increase the memory regions from 2 to 4. This is easily conceived, but indeed quite challenging to actually implement. |
> I described what made it fast
I still partially disagree. Memory-level parallelism is one aspect but I think instruction-level parallelism and hiding data dependencies from iteration to iteration are the bigger factors. That's mainly what I chose to fully exploit in glidesort by actively splitting merges into smaller merges, and implementing bidirectional quicksort. That was the main takeaway from my glidesort talk.
> and it was obvious there was general applicability. If I understanding your reasoning correctly, because you took inspiration from my more advanced unguarded two directional merge, you feel like a guarded two directional merge is some new novel discovery.
The novelty is pointing out that also the hiding of data dependencies and instruction-level parallelism is making it fast, and to then go on and exploit this to the maximum. E.g. if I turn off the merge splitting in glidesort (reducing the amount of parallel merges) performance degrades 17.5% on my machine.
Your README explicitly called them parity merges, and that the arrays have to be of equal length. You might find the extension to unbalanced merges and using it for all merges instead of only the smallsort obvious, I disagree. If it is obvious... why didn't you do it? Correct me if I'm wrong but I was under the impression quadsort only uses the bidirectional merges for the small base case where arrays have statically known (equal) sizes.
> Subsequently you then proceed to give me zero credit for branchless merging in your Youtube presentation, instead referencing a branchless merge that gave a 10-20% speed-up instead of quadsort's 150% speed-up.
Because I was already aware of it, from the earlier work. That they implemented it poorly is besides the point. You and I both know that a bad implementation of a good idea in sorting can easily destroy all performance gains.
The part that was novel to me from your work on quadsort was eliminating the bounds check on perfectly balanced merges and the original idea of merging from both ends. That I credit you for.
> You also claimed in the presentation there were no bidirectional branchless algorithms written to date, this is simply not true. > > This does upset me. It would be a simple matter, and the right thing, to point out that quadsort introduced bidirectional branchless merging.
I don't think I claimed that. And I have modified the slides now to explicitly mention quadsort on the slide for bidirectional merging.