| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dragontamer 1816 days ago

GPU-spinlocks are a bad idea, unless the spinlock is applied over the entire Thread-group.

Even then, I'm pretty sure the spinlock is a bad idea, because you probably should be using GPUs as a coprocessor and enforcing "orderings" over CUDA-Streams or OpenCL Task Graphs. The kernel-spawn and kernel-end mechanism provides you your synchronization functionality ("happens-before") when you need it.

---------

From there on out: the GPU-low level synchronization of choice is the thread-barrier (which can extend out beyond a wavefront, but only up to a block).

--------

So that'd be my advice: use a thread-barrier at the lowest level for thread blocks (synchronization between 1024 threads and below). And use kernel-start / kernel-end graphs (aka: CUDA Stream and/or OpenCL Task Graphs) for synchronizing groups of more than 1024 threads together.

Otherwise, I've done some experiments with acquire/release and basic lock/unlock mechanisms. They seem to work as expected. You get deadlocks immediately on older hardware because of the implicit SIMD-execution (so you want only thread#0 or active-thread#0 to perform the lock for the whole wavefront / thread block). You'll still want to use thread-barriers for higher performance synchronization.

Frankly, I'm not exactly sure why you'd want to use a spinlock since thread-barriers are simply higher performance in the GPU world.

1 comments

raphlinus 1816 days ago

In general spinlocks are a bad idea, but you do see them in contexts like decoupled look-back. As you say, thread granularity is a problem (unless you're on CUDA on Volta+ hardware, which has independent thread scheduling), so you want threadgroup or workgroup granularity.

In any case, I'm interested in pushing the boundaries of lock-free algorithms. It is of course easy to reason about kernel-{start/end} synchronization, but the granularity may be too coarse for some interesting applications.

link

dragontamer 1816 days ago

This is the first time I've heard of the term "decoupled look-back". But I see that it refers to CUB's implementation of device-wide scan.

I briefly looked at the code, and came across: https://github.com/NVIDIA/cub/blob/main/cub/agent/agent_scan...

I'm seeing lots of calls to "CTA_SYNC()", which ends up being just a "__syncthreads" (a simple thread-barrier). See: https://github.com/NVIDIA/cub/blob/a8910accebe74ce043a13026f...

I admit that I'm looking rather quickly though, but... I'm not exactly seeing where this mysterious "spinlock" is that you're talking about. I haven't tried very hard yet but maybe you can point out what code exactly in this device_scan / decoupled look-back uses a spinlock? Cause I'm just not seeing it.

----------

And of course: a call to cub's "device scan" is innately ordered to kernel-start / kernel-end. So there's your synchronization mechanism right there and then.

link

raphlinus 1816 days ago

I don't think CUB is doing decoupled look-back, the reference you want is: https://research.nvidia.com/publication/single-pass-parallel...

It doesn't use the word "spin" but repeated polling (step 4 in the algorithm presented in section 4.1, particularly when the flag is X) is basically the same.

link

dragontamer 1816 days ago

3rd paragraph:

> In this report, we describe the decoupled-lookback method of single-pass parallel prefix scan and its implementation within the open-source CUB library of GPU parallel primitives

The CUB-library also states:

https://nvlabs.github.io/cub/structcub_1_1_device_scan.html

>> As of CUB 1.0.1 (2013), CUB's device-wide scan APIs have implemented our "decoupled look-back" algorithm for performing global prefix scan with only a single pass through the input data, as described in our 2016 technical report [1]

Where [1] is a footnote pointing at the exact paper you just linked.

-----------

> It doesn't use the word "spin" but repeated polling (step 4 in the algorithm presented in section 4.1, particularly when the flag is X) is basically the same.

That certainly sounds spinlock-ish. At least that gives me what to look for in the code.

link

raphlinus 1816 days ago

Ah, good then, I didn't know that, thanks for the cite. I haven't studied the CUB code on it carefully.

link