Hacker News new | ask | show | jobs
by jhokanson 1591 days ago
It is not exactly clear to me what is going on with threads (I guess you are using all of them?). I haven't done too much in this space but anecdotally I've had better luck if my summation is explicitly split into sub-summation tasks. It is not clear if that is being done here. It looks like a single summation loop that the author is expecting the computer to magically split across multiple threads. I'd be interested in seeing what this looks like if instead the task were to add chunks of the original dataset into results per thread (e.g, first 8000 samples on first thread, next 8000 on 2nd thread, etc.), with a final accumulation loop across all threads. Again, the author may be trying this and this is not my area of expertise but I've had decent luck saturating the memory bus with a similar approach.
2 comments

Here is the source and the threads: https://github.com/unum-cloud/ParallelReductions/blob/fd16d9...

OFC we don't expect the compiler to instantiate them for us, it's not OpenMP :) That one we covered in previous articles. OpenMP gave us about 50 GB/s with all cores enabled and 80 GB/s with part of them disabled.

Is there an advantage to using taskflow for parallel for, if you already have another threadpool implementation? I recently removed taskflow in a project that was only being used for a parallel for loop (as part of a larger refactor, the code had a number of issues...), and I'm wondering if that was a mistake now that I see that pattern somewhere else. :)
Nope, dont worry :) I did it our of laziness. I didn’t want to implement a task queue for std::thread-s, so I took TaskFlow, as one of the most famous solutions. You can definitely get better async task management with enough C++ experience and time.
My c++ is not great (so it is hard for me to tell what is going on) and I'm used to OpenMP where my understanding has always been that you tend to get a single thread per processor (or per hyper-thread) -- not sure if that is guaranteed with the way your code is laid out? Perhaps it really is a NUMA issue as others suggest. I will note that one other variation I had (as it looks like you are already splitting across threads) is that the chunk sizes were actually smaller than the # of threads which meant a faster thread would take more chunks rather than waiting on the slowest thread. Good luck!
Doug Lea of Java Memory Model and concurrency note went pretty far down this rabbit hole. Not only do you use separate counters/queues per thread/core, but you also put empty space around them so that you don't accidentally share cache lines. I don't know what they do now, but at the time some of the data structures in that library used arrays where only every 8th or 16th entry is used to avoid two cores trying to read from the same cache line.

Typically allocating a separate data structure per actor also accomplishes this as a happy accident. If the thread does the allocation, then it has a better chance of being in the right bank as well.

Yes, thats needed when you have counters in global memory. In that case, instead of just having vector<double> you would put each double into a stricture aligned to 64 byte addresses. Here all the counters are on local stack, so that trick unfortunately wont help
For the single threaded version, I believe they have a similar problem with

    auto sums = _mm256_set1_ps(0);
    for (; it + 8 < end; it += 8)
        sums = _mm256_add_ps(_mm256_loadu_ps(it), sums);
Where each SMD op is trying to overwrite to a compact data structure.

But in the threaded version https://github.com/unum-cloud/ParallelReductions/blob/fd16d9... they have separate slots for an accumulator but it's still in a shared vector, which most likely has the issue I described.