|
|
|
|
|
by lifthrasiir
928 days ago
|
|
Will it be viable to use SIMD to further optimize the algorithm when the predicate is simple enough? For example, one can imagine that you take v[j] .. v[j+N-1] and somehow weed them into two partitions u[0] .. u[k] and u[k+1] .. u[N-1] (probably in SIMD registers), which get pasted accordingly at v[i] or v[j]. The current assembly indeed looks very tight but a higher throughput at the expense of longer code is a common theme in SIMD, I guess. |
|
Then partitioning is 'simply' a vector comparison, two masked compressing stores (through shuffles or _mm_mask_compressstoreu_epi32) with one of the masks inverted, and counting how many elements were smaller with _mm256_movemask_epi8 and a popcnt.
For an out-of-place partition you can interleave the loops of one going left-to-right and one going right-to-left to increase instruction-level parallelism.