|
|
|
|
|
by nwallin
2139 days ago
|
|
Barrier synchronization is always going to wreck parallel performance. When possible, it's good to break up your work so that data dependencies are explicit and fine grained. So instead of breaking up work like this: IIIIJJJJKKKKLLLL
<barrier>
EEEEFFFFGGGGHHHH
<barrier>
AAAABBBBCCCCDDDD
... where each row is dependent on the entire previous row, break it up like this: HHHHIIIIJJJJKKKK
EEEEEEFFFFGGGGGG
AAAABBBBCCCCDDDD
Where E is dependent on A and B, (and begins when they are done) H is dependent only on E, F is dependent on B and C, etc. This way you don't have to wait for the entire strip to finish, you can start work on E after A+B, but while D is still working. This also gives you weird block sizes for "free", which is a huge no-no with barrier synchronization. For instance, Es and Gs are size 6; if the first example had most elements of size 4, but one element of size 6, it would be catastrophic. All your threads would be done when the final thread is only 66% complete, and it's single threaded from there on out for that strip. |
|
[1] https://en.wikipedia.org/wiki/Prefix_sum#:~:text=A%20work%2D....