I wonder how fast it is compared to djbsort https://github.com/jart/cosmopolitan/blob/master/libc/nexgen... and longsort https://github.com/jart/cosmopolitan/blob/e011973593407f576d... djbsort is outrageously fast for 32-bit ints with avx2 (which unlike avx512 it's the avx we can reliably use in open source). But there's never been a clear instruction set to use on Intel / AMD for sorting 64-bit ints that's reliably fast. So if this thing can actually sort 64-bit integers 10x faster on avx2 I'd be so thrilled.
Yes, we can sort 64-bit ints. The speedup on AVX2 is roughly 2/3 of the 10x we see on AVX-512.
Longsort appears to be an autovectorized sorting network. That's only going to be competitive or even viable for relatively small arrays (thousands). See comments above on djbsort.
Why not use whichever AVX the CPU has? Not a problem when using runtime dispatch :)
The chart above shows a 1000x (3 orders of magnitude base 10) increase in energy consumption relative to a register move (it really should be called copy).
I'm not sure what you mean by that. You can't assume the presence of AVX or AVX2 without explicitly checking for it, because Intel was still disabling those features on new low-end Pentium and Celeron parts at least a recently as Comet Lake (2020). Sure, AVX2 support is much more widespread than AVX512 support, but that has nothing to do with open-source and it's a bit strange to describe that in terms of reliability.
Some of us like to think of ourselves writing open source as serving the public interest. It's hard to do that if you're focusing on an ISA the public doesn't have. I haven't seen any consumer hardware that has AVX512.
Lots of consumer hardware has AVX512 (I have an 11th gen Intel laptop CPU that has it).
Regardless, Clang and GCC both support function multi-versioning where you supply multiple versions of a function and specify which CPU features each implementation needs, and the best version of the function will be selected at runtime based on the results of cpuid. For example, you can use this to write a function that uses no vector instructions, SEE, AVX2, or AVX512 and all versions will be compiled into the executable and the best version you can actually use will be selected at runtime. This is how glibc selects the optimal version of functions like memset/memcpy/memcmp, as there are vector instructions that significantly speed these functions up.
I agree AVX-512 is not exactly widespread on client CPUs but as akelly mentions, it does exist (e.g. Icelake).
What we do is dispatch to the best available instruction set at runtime - that costs only an indirect branch, plus somewhat larger binary and longer compile time.
Even if AVX512 was entirely constrained to server hardware (it's not), how would it be contrary to the public interest for open-source software to take advantage of those instructions?
Intel 10th gen mobile and 11th gen mobile and desktop, excluding Pentium and Celeron, have AVX-512. And all 12th gen have it on the P cores but not the E cores. If the E cores are enabled then AVX-512 is unavailable.
On 12th gen they disabled it on the P cores too even with E cores disabled with a microcode update. A lot of newer systems don't have access to the older microcode, and microcode doesn't typically let you downgrade.
There are workarounds for downgrading microcode, because the CPU itself doesn't actually have non-volatile storage for microcode updates and relies on the motherboard firmware to upload updates on each boot (and motherboard firmware can often be downgraded, possibly after changing a setting to allow that).
Which is probably why Intel has changed to disabling AVX512 using fuses in more recently manufactured Alder Lake CPUs.
My point with "a lot of newer systems" was that there are motherboards now that completely lack a version of their firmware with the microcode that allows avx-512. There's nothing to downgrade to without an exploit to allow making your own firmware images with mixed and matched components.
Why not use whichever AVX the CPU has? Not a problem when using runtime dispatch :)