|
|
|
|
|
by milcron
3350 days ago
|
|
A great paper which delves into different approaches for parallel computing is "Three layer cake for shared-memory programming" [0]. They characterize parallel programming into three broad strategies: 1. SIMD (parallel lines) 2. Fork-Join (a directed acyclic graph of operations) 3. Message-Passing (a graph of operations) GPUs are great at SIMD, but bad at the other sorts of parallelism. [0] https://www.researchgate.net/publication/228683178_Three_lay... |
|
On Nvidia GPUs, 16 to 32 warps per SM x 60 SMs on a P100 gives a lot of hardware threads (1 thread == 1 warp) in flight at once; these are allowed to branch completely independent of each other (I forget the maximum occupancy of a P100's SM in warps at lowest resource use). Furthermore, you can use global memory atomics and spin-locks for event driven programming, work-stealing, etc. This kind of stuff is used in, e.g., persistent kernels. Of course, the single kernel that is being run must handle all of the code for all of the tasks. Not easy to write, but possible.