|
|
|
|
|
by jhj
3347 days ago
|
|
You can express some forms of #2 or #3 well on a GPU. It depends upon how wide the graph of tasks is (maximum number of concurrent tasks possible in the graph). On Nvidia GPUs, 16 to 32 warps per SM x 60 SMs on a P100 gives a lot of hardware threads (1 thread == 1 warp) in flight at once; these are allowed to branch completely independent of each other (I forget the maximum occupancy of a P100's SM in warps at lowest resource use). Furthermore, you can use global memory atomics and spin-locks for event driven programming, work-stealing, etc. This kind of stuff is used in, e.g., persistent kernels. Of course, the single kernel that is being run must handle all of the code for all of the tasks. Not easy to write, but possible. |
|