|
|
|
|
|
by gergo_barany
736 days ago
|
|
The use case in the article here is for a single 8-byte (one size_t) popcount at a time. The comparison in your source is for when you want to compute many popcounts in parallel. The smallest size it compares is 32 bytes, where the scalar popcount instruction beats even the fastest other variants by a factor of almost 2x. The AVX2 lookup variant only starts beating it starting at 256 or 512 bytes, depending on the exact CPU. And even then, that variant is not equivalent to (a parallel version of) the scalar code shown in this article. So, good that the parallel version exists, but it's really apples and oranges compared to this article's problem. |
|