|
|
|
|
|
by camel-cdr
162 days ago
|
|
> The answer, if it’s not obvious from my tone already:), is 8%. Not if the data is small and in cache. > The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it. I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets. |
|