|
|
|
|
|
by orlp
932 days ago
|
|
SIMD partitioning is definitely a thing, but I think it is ill-suited for Lomuto-style partitioning. I think fulcrum (what crumsort uses) or out-of-place partitioning (in both cases the output buffers for < and >= are distinct) like glidesort does is the most amenable to SIMDization. Then partitioning is 'simply' a vector comparison, two masked compressing stores (through shuffles or _mm_mask_compressstoreu_epi32) with one of the masks inverted, and counting how many elements were smaller with _mm256_movemask_epi8 and a popcnt. For an out-of-place partition you can interleave the loops of one going left-to-right and one going right-to-left to increase instruction-level parallelism. |
|
You face some additional trouble when the right region of already-partitioned elements is smaller than 4. I think this is solvable but I don't have a good proposal off the top of my head.
Edit: well, one solution is to look at the first 8 elements. If at least 4 belong on the right, you can process these 8 carefully and then process the input forwards without worrying about your loads overlapping. If fewer than 4 belong on the right, you can swap these 8 with the last 8, process those 8 carefully, and process the input backwards without worrying about your loads overlapping.
All of this is what we have to do anyway when targeting AVX2, even if filtering some elements and discarding others or partitioning to 2 output buffers, because there are no compressing instructions.