| GPU-spinlocks are a bad idea, unless the spinlock is applied over the entire Thread-group. Even then, I'm pretty sure the spinlock is a bad idea, because you probably should be using GPUs as a coprocessor and enforcing "orderings" over CUDA-Streams or OpenCL Task Graphs. The kernel-spawn and kernel-end mechanism provides you your synchronization functionality ("happens-before") when you need it. --------- From there on out: the GPU-low level synchronization of choice is the thread-barrier (which can extend out beyond a wavefront, but only up to a block). -------- So that'd be my advice: use a thread-barrier at the lowest level for thread blocks (synchronization between 1024 threads and below). And use kernel-start / kernel-end graphs (aka: CUDA Stream and/or OpenCL Task Graphs) for synchronizing groups of more than 1024 threads together. Otherwise, I've done some experiments with acquire/release and basic lock/unlock mechanisms. They seem to work as expected. You get deadlocks immediately on older hardware because of the implicit SIMD-execution (so you want only thread#0 or active-thread#0 to perform the lock for the whole wavefront / thread block). You'll still want to use thread-barriers for higher performance synchronization. Frankly, I'm not exactly sure why you'd want to use a spinlock since thread-barriers are simply higher performance in the GPU world. |
In any case, I'm interested in pushing the boundaries of lock-free algorithms. It is of course easy to reason about kernel-{start/end} synchronization, but the granularity may be too coarse for some interesting applications.