Hacker News new | ask | show | jobs
by raphlinus 1816 days ago
In general spinlocks are a bad idea, but you do see them in contexts like decoupled look-back. As you say, thread granularity is a problem (unless you're on CUDA on Volta+ hardware, which has independent thread scheduling), so you want threadgroup or workgroup granularity.

In any case, I'm interested in pushing the boundaries of lock-free algorithms. It is of course easy to reason about kernel-{start/end} synchronization, but the granularity may be too coarse for some interesting applications.

1 comments

This is the first time I've heard of the term "decoupled look-back". But I see that it refers to CUB's implementation of device-wide scan.

I briefly looked at the code, and came across: https://github.com/NVIDIA/cub/blob/main/cub/agent/agent_scan...

I'm seeing lots of calls to "CTA_SYNC()", which ends up being just a "__syncthreads" (a simple thread-barrier). See: https://github.com/NVIDIA/cub/blob/a8910accebe74ce043a13026f...

I admit that I'm looking rather quickly though, but... I'm not exactly seeing where this mysterious "spinlock" is that you're talking about. I haven't tried very hard yet but maybe you can point out what code exactly in this device_scan / decoupled look-back uses a spinlock? Cause I'm just not seeing it.

----------

And of course: a call to cub's "device scan" is innately ordered to kernel-start / kernel-end. So there's your synchronization mechanism right there and then.

I don't think CUB is doing decoupled look-back, the reference you want is: https://research.nvidia.com/publication/single-pass-parallel...

It doesn't use the word "spin" but repeated polling (step 4 in the algorithm presented in section 4.1, particularly when the flag is X) is basically the same.

3rd paragraph:

> In this report, we describe the decoupled-lookback method of single-pass parallel prefix scan and its implementation within the open-source CUB library of GPU parallel primitives

The CUB-library also states:

https://nvlabs.github.io/cub/structcub_1_1_device_scan.html

>> As of CUB 1.0.1 (2013), CUB's device-wide scan APIs have implemented our "decoupled look-back" algorithm for performing global prefix scan with only a single pass through the input data, as described in our 2016 technical report [1]

Where [1] is a footnote pointing at the exact paper you just linked.

-----------

> It doesn't use the word "spin" but repeated polling (step 4 in the algorithm presented in section 4.1, particularly when the flag is X) is basically the same.

That certainly sounds spinlock-ish. At least that gives me what to look for in the code.

Ah, good then, I didn't know that, thanks for the cite. I haven't studied the CUB code on it carefully.