|
|
|
|
|
by hinkley
1596 days ago
|
|
For the single threaded version, I believe they have a similar problem with auto sums = _mm256_set1_ps(0);
for (; it + 8 < end; it += 8)
sums = _mm256_add_ps(_mm256_loadu_ps(it), sums);
Where each SMD op is trying to overwrite to a compact data structure.But in the threaded version https://github.com/unum-cloud/ParallelReductions/blob/fd16d9... they have separate slots for an accumulator but it's still in a shared vector, which most likely has the issue I described. |
|