Hacker News new | ask | show | jobs
by ashvardanian 1599 days ago
Yes, thats needed when you have counters in global memory. In that case, instead of just having vector<double> you would put each double into a stricture aligned to 64 byte addresses. Here all the counters are on local stack, so that trick unfortunately wont help
1 comments

For the single threaded version, I believe they have a similar problem with

    auto sums = _mm256_set1_ps(0);
    for (; it + 8 < end; it += 8)
        sums = _mm256_add_ps(_mm256_loadu_ps(it), sums);
Where each SMD op is trying to overwrite to a compact data structure.

But in the threaded version https://github.com/unum-cloud/ParallelReductions/blob/fd16d9... they have separate slots for an accumulator but it's still in a shared vector, which most likely has the issue I described.