Hacker News new | ask | show | jobs
by nwallin 2139 days ago
Barrier synchronization is always going to wreck parallel performance. When possible, it's good to break up your work so that data dependencies are explicit and fine grained. So instead of breaking up work like this:

    IIIIJJJJKKKKLLLL

    <barrier>

    EEEEFFFFGGGGHHHH

    <barrier>

    AAAABBBBCCCCDDDD
... where each row is dependent on the entire previous row, break it up like this:

    HHHHIIIIJJJJKKKK

    EEEEEEFFFFGGGGGG

    AAAABBBBCCCCDDDD
Where E is dependent on A and B, (and begins when they are done) H is dependent only on E, F is dependent on B and C, etc. This way you don't have to wait for the entire strip to finish, you can start work on E after A+B, but while D is still working. This also gives you weird block sizes for "free", which is a huge no-no with barrier synchronization. For instance, Es and Gs are size 6; if the first example had most elements of size 4, but one element of size 6, it would be catastrophic. All your threads would be done when the final thread is only 66% complete, and it's single threaded from there on out for that strip.
1 comments

This pencil and paper approach to breaking up work is good at parallelizing things that don't appear parallelizable at first glance, e.g. prefix sum [1]. Compilers are good at identifying opportunities to reorder instructions in ways that break low level data dependencies. They tend to miss the big higher level opportunities like this.

[1] https://en.wikipedia.org/wiki/Prefix_sum#:~:text=A%20work%2D....