|
|
|
|
|
by raphlinus
1816 days ago
|
|
In general spinlocks are a bad idea, but you do see them in contexts like decoupled look-back. As you say, thread granularity is a problem (unless you're on CUDA on Volta+ hardware, which has independent thread scheduling), so you want threadgroup or workgroup granularity. In any case, I'm interested in pushing the boundaries of lock-free algorithms. It is of course easy to reason about kernel-{start/end} synchronization, but the granularity may be too coarse for some interesting applications. |
|
I briefly looked at the code, and came across: https://github.com/NVIDIA/cub/blob/main/cub/agent/agent_scan...
I'm seeing lots of calls to "CTA_SYNC()", which ends up being just a "__syncthreads" (a simple thread-barrier). See: https://github.com/NVIDIA/cub/blob/a8910accebe74ce043a13026f...
I admit that I'm looking rather quickly though, but... I'm not exactly seeing where this mysterious "spinlock" is that you're talking about. I haven't tried very hard yet but maybe you can point out what code exactly in this device_scan / decoupled look-back uses a spinlock? Cause I'm just not seeing it.
----------
And of course: a call to cub's "device scan" is innately ordered to kernel-start / kernel-end. So there's your synchronization mechanism right there and then.