Hacker News new | ask | show | jobs
by ashvardanian 1592 days ago
Here is the source and the threads: https://github.com/unum-cloud/ParallelReductions/blob/fd16d9...

OFC we don't expect the compiler to instantiate them for us, it's not OpenMP :) That one we covered in previous articles. OpenMP gave us about 50 GB/s with all cores enabled and 80 GB/s with part of them disabled.

2 comments

Is there an advantage to using taskflow for parallel for, if you already have another threadpool implementation? I recently removed taskflow in a project that was only being used for a parallel for loop (as part of a larger refactor, the code had a number of issues...), and I'm wondering if that was a mistake now that I see that pattern somewhere else. :)
Nope, dont worry :) I did it our of laziness. I didn’t want to implement a task queue for std::thread-s, so I took TaskFlow, as one of the most famous solutions. You can definitely get better async task management with enough C++ experience and time.
My c++ is not great (so it is hard for me to tell what is going on) and I'm used to OpenMP where my understanding has always been that you tend to get a single thread per processor (or per hyper-thread) -- not sure if that is guaranteed with the way your code is laid out? Perhaps it really is a NUMA issue as others suggest. I will note that one other variation I had (as it looks like you are already splitting across threads) is that the chunk sizes were actually smaller than the # of threads which meant a faster thread would take more chunks rather than waiting on the slowest thread. Good luck!