| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ack_complete 507 days ago
	255 is not ideal either because it results in a partial vector at the end of either 15 or 31 elements. 256-V where V is the vector size is better, so 240 for SSE2/NEON or 224 for AVX2. This is still lower than optimal because the compiler will reduce to a uint8. Both SSE2 and NEON support reducing to a wider value by _mm_sad_epu8 and vpadal_u8, respectively. This allows for 255 iterations in the inner loop instead of 15 or 7.

1 comments

nicula 507 days ago

Great observations, thanks!

I wrote the code that you suggested (LMK if I understood your points): https://godbolt.org/z/jW4o3cnh3

And here's the benchmark output, on my machine: https://0x0.st/8SsG.txt (v1 is std::count_if(), v2 is the optimization from my blog post, and v3 is what you suggested).

v2 is faster, but v3 is still quite fast.

link

ack_complete 507 days ago

Yeah, the difference you're seeing is likely due to the inner loop doing so few iterations, in addition to not being unrolled. A hand-rolled version doing 63 iterations of an x4 unrolled loop should be able to saturate the execution core (it'll be bottlenecked by load throughput). But it'll be tough to get the compiler to generate that without intrinsics.

link

meisel 507 days ago

The fastest thing I would think is to process 255 vectors at a time, using an accumulation vector with the k-th byte in the vector representing the # of evens seen for the k-th byte of the vectors in that 255-vector chunk. But, I don’t know how to make autovectorization do that. I could only do that with intrinsics.

link