In short, it turns out not to help for single core with vectors, but a few initial passes of ips4o (with 64..256-way partitioning) is faster for parallel sorts.
In short, it turns out not to help for single core with vectors, but a few initial passes of ips4o (with 64..256-way partitioning) is faster for parallel sorts.