The E3 v5 series were part of the generation codenamed Skylake, introduced in 2015. But the Skylake microarchitecture was reused in each subsequent new Intel desktop processor generation through 2019's Comet Lake (due to Intel's 10nm failure). They didn't introduce a new microarchitecture in that product segment until Rocket Lake and Alder Lake, both in 2021. So despite being almost 7 years old, the E3-1220v5 is still representative of most of the installed base for Intel desktops and entry-level workstations, and a large chunk of their mobile installed base.
(The original E3-1220 predates AVX2 by two years, so this code wouldn't even run on it.)
Oh, thanks for pointing that out. Golly, Sandy Bridge is a bit old, yes - but still the result is surprising.
djb reports 8000 cycles for int32 x 256 - this is much slower than we benchmark in bench_sort.cc, even for AVX2 (which he confirms is being reached). Not sure what's going on.
In short, what is being compared is O(1) djbsort sorting network, vs our full quicksort with pivot sampling, partitioning, then sorting network.
This is because our sorting network size is 16 * elements_per_vector i.e. 128 in this configuration.