Hacker News new | ask | show | jobs
by hinkley 1596 days ago
For the single threaded version, I believe they have a similar problem with

    auto sums = _mm256_set1_ps(0);
    for (; it + 8 < end; it += 8)
        sums = _mm256_add_ps(_mm256_loadu_ps(it), sums);
Where each SMD op is trying to overwrite to a compact data structure.

But in the threaded version https://github.com/unum-cloud/ParallelReductions/blob/fd16d9... they have separate slots for an accumulator but it's still in a shared vector, which most likely has the issue I described.