Hacker News new | ask | show | jobs
by BeeOnRope 1481 days ago
Interesting post and paper, thanks!

Sometimes the state of the art is not found in another paper but somewhere else, e.g,. there is vxsort by Dan Shechter (damageboy):

https://github.com/damageboy/vxsort-cpp/tree/master

He uses a similar approach and while I'm not sure how it compares to the older Blacher et al version, I expect it to be in the ballpark.

He's written an excellent series of blog posts on the design & implementation:

https://bits.houmus.org/2020-01-28/this-goes-to-eleven-pt1

1 comments

Thanks for sharing the pointers. I hadn't come across that, will read soon :)
Interestingly from the portability angle, this was all written in C# but with the SIMD intrinsics there he was still able to obtain ~SOTA performance.

The repo I linked is his C++ version of the C# original.

It was submitted to the C# standard library but was rejected (as an observer, the rejection was disappointing) – had it been accepted C# would have been in the interesting position of having a faster sort than any native language standard library, at least for integer keys.

I've read parts 1-5, there doesn't seem to be a part6. Thanks again for the link, it is regrettable that there is duplication of effort. His approach is quite similar to ours (including HeapSort, sorting network, unrolled in-place partition), and re-invents what he calls "double-pumped partitioning" (introduced by Bramas).

Copying the blog post to arxiv and citing some related work so it comes up in alerts and cited-by searches would [have made/make] this good work visible to a wider circle, including the algorithms community.

It's nice to see a C++ version, this will be easier for me to benchmark, and it also seems to support other built-in types now, as well as AVX2+AVX-512 (though not other platforms). A quick list of differences based on reading the blog:

* Nifty idea for simplifying the branch in partition. (We also found that branchless code is worse)

* Major effort to make all reads vector-aligned. This could make a difference especially on AVX-512. I had tried two variants but it was difficult to make this safe for asan (no out of bounds reads allowed).

* The lookup table is 2KB, which could be halved using 32-bit broadcast + variable shift instead of PDEP.

* The sorting network seems to be half the size and bitonic. I haven't yet compared the number of permute instructions (not obvious from the source because it's generated).

* The pivot is only a median of three. We use a much larger sample.