Hacker News new | ask | show | jobs
by lifthrasiir 928 days ago
Will it be viable to use SIMD to further optimize the algorithm when the predicate is simple enough? For example, one can imagine that you take v[j] .. v[j+N-1] and somehow weed them into two partitions u[0] .. u[k] and u[k+1] .. u[N-1] (probably in SIMD registers), which get pasted accordingly at v[i] or v[j]. The current assembly indeed looks very tight but a higher throughput at the expense of longer code is a common theme in SIMD, I guess.
1 comments

SIMD partitioning is definitely a thing, but I think it is ill-suited for Lomuto-style partitioning. I think fulcrum (what crumsort uses) or out-of-place partitioning (in both cases the output buffers for < and >= are distinct) like glidesort does is the most amenable to SIMDization.

Then partitioning is 'simply' a vector comparison, two masked compressing stores (through shuffles or _mm_mask_compressstoreu_epi32) with one of the masks inverted, and counting how many elements were smaller with _mm256_movemask_epi8 and a popcnt.

For an out-of-place partition you can interleave the loops of one going left-to-right and one going right-to-left to increase instruction-level parallelism.

You can port the algorithm from TFA pretty straightforwardly. For example with AVX2 and u64, load 4 elements from i and from j, compare the 4 elements swapped leftward with the threshold, use a lookup table of shuffles and compute shuffle(xs, lut[movemask(compare_result)]). store this value on the left and advance the pointer by popcnt(movemask(compare_result)). store the raw value previously loaded on the right and advance the pointer by 4.

You face some additional trouble when the right region of already-partitioned elements is smaller than 4. I think this is solvable but I don't have a good proposal off the top of my head.

Edit: well, one solution is to look at the first 8 elements. If at least 4 belong on the right, you can process these 8 carefully and then process the input forwards without worrying about your loads overlapping. If fewer than 4 belong on the right, you can swap these 8 with the last 8, process those 8 carefully, and process the input backwards without worrying about your loads overlapping.

All of this is what we have to do anyway when targeting AVX2, even if filtering some elements and discarding others or partitioning to 2 output buffers, because there are no compressing instructions.

That's actually fair enough. I'm not certain how well the CPU will like the overlapping SIMD reads/writes on the left hand side.

If you have AVX512 available shuffle(xs, lut[movemask(compare_result)]) could also be two compressions with one reverse shuffle in between.

I don't know either! I've never done something like this. Usually I am filtering to the same buffer or one or two other buffers, so I would never immediately load some stuff I had just stored.