Hacker News new | ask | show | jobs
by orlp 1481 days ago
SIMD based sorting isn't new, e.g. djbsort: https://sorting.cr.yp.to/. I find it strange the paper doesn't cite it or compare to it at all.

EDIT: to be fair it does seem that Bramas' first published work on SIMD sorting (that I could find) predates djbsort, https://arxiv.org/abs/1704.08579. A comparison would still be nice though.

5 comments

It's not new within Google either. I remember poking through Jeff Dean's original MapReduce code (written c. 2003) and finding a SIMD-based QuickSort that was the in-memory portion of the external sort algorithm.

It looks like this algorithm is interesting in the specifics of how it works, eg. the use of a compress-store instruction for partitioning. The algorithm I mentioned above mostly focused on speeding up comparisons through the use of SIMD instructions, rather than changing the partitioning part of the QuickSort algorithm itself. Not sure how that compares to djbsort, I didn't see technical details of the algorithm on the page.

Interesting, I did not know that, thanks for the pointer. There is also an effort underway to update LLVM std::sort to BlockQuicksort, which can use some SIMD instructions similarly to what you describe.
The blog post does seem to be quite explicit about the fact that their contribution is in the field of portability, not novelty.
nothing happened before google

https://dl.acm.org/doi/10.1145/113379.113380

hah, our paper actually cites "Radix sort for vector multiprocessors" by the same author, also from 1991 :)
What exactly should be cited? I had a quick look and djbsort seems to be a constant-time algorithm, presumably a sorting network. They report about 2 GB/s for 768 int32 on 3 GHz Skylake; IIRC our sorting network reaches 3-4 GB/s, but for a smaller input size (256 int32 or int64). Hard to compare directly, but djbsort is likely slower, and the sorting network is anyway a small fraction of the total time in a Quicksort.
djbsort uses a sorting network, not a quicksort. It is also designed to be in constant time, which is not the case with the linked blog post. So the restrictions are different.

This is not my field, so I can't judge the relevance of this, but the author cites the state of the art. It just seems that djbsort is not the state of the art and therefore does not need to be cited.

I still don't understand the hype around Bernstein, he is quite relevant in cryptography/cryptography implementations but not everything he does is gold.

:) FWIW we also use a constant-time sorting network as the base case of the Quicksort recursion, that's a widely used optimization.
For 32bits, djb has done light testing, indicating djbsort would be a better base case.

> So far I haven't been able to verify these vqsort speed claims. On the contrary, it seems that, for 32-bit data types on AVX2, vqsort would be faster if its base-case code were replaced by a call to the 2018 djbsort code. Similarly, vqsort should reuse vxsort-cpp for AVX-512.

https://twitter.com/hashbreaker/status/1533314687726538753

The article acknowledges SIMD based sorting and states what makes this novel.
The whole "stick SIMD into it and compare it with the stdlib" thing is a really common learning trope. Feels more like a portfolio filler, albeit probably not a useless one.
We also compare with state of the art platform-specific code. The interesting thing here is that they turn out to be slower than our approach using portable intrinsics (github.com/google/highway) :)