Hacker News new | ask | show | jobs
by hinkley 1597 days ago
Doug Lea of Java Memory Model and concurrency note went pretty far down this rabbit hole. Not only do you use separate counters/queues per thread/core, but you also put empty space around them so that you don't accidentally share cache lines. I don't know what they do now, but at the time some of the data structures in that library used arrays where only every 8th or 16th entry is used to avoid two cores trying to read from the same cache line.

Typically allocating a separate data structure per actor also accomplishes this as a happy accident. If the thread does the allocation, then it has a better chance of being in the right bank as well.

1 comments

Yes, thats needed when you have counters in global memory. In that case, instead of just having vector<double> you would put each double into a stricture aligned to 64 byte addresses. Here all the counters are on local stack, so that trick unfortunately wont help
For the single threaded version, I believe they have a similar problem with

    auto sums = _mm256_set1_ps(0);
    for (; it + 8 < end; it += 8)
        sums = _mm256_add_ps(_mm256_loadu_ps(it), sums);
Where each SMD op is trying to overwrite to a compact data structure.

But in the threaded version https://github.com/unum-cloud/ParallelReductions/blob/fd16d9... they have separate slots for an accumulator but it's still in a shared vector, which most likely has the issue I described.