Hacker News new | ask | show | jobs
by raphlinus 1816 days ago
A GPU followup to this article.

While on CPU sequentially consistent semantics are efficient to implement, that seems to be much less true on GPU. Thus, Vulkan completely eliminates sequential consistency and provides only acquire/release semantics[1].

It is extremely difficult to reason about programs using these advanced memory semantics. For example, there is a discussion about whether a spinlock implemented in terms of acquire and release can be reordered in a way to introduce deadlock (see reddit discussion linked from [2]). I was curious enough about this I tried to model it in CDSChecker, but did not get definitive results (the deadlock checker in that tool is enabled for mutexes provided by API, but not for mutexes built out of primitives). I'll also note that using AcqRel semantics is not provided by the Rust version of compare_exchange_weak (perhaps a nit on TFA's assertion that Rust adopts the C++ memory model wholesale), so if acquire to lock the spinlock is not adequate, it's likely it would need to go to SeqCst.

Thus, I find myself quite unsure whether this kind of spinlock would work on Vulkan or would be prone to deadlock. It's also possible it could be fixed by putting a release barrier before the lock loop.

We have some serious experts on HN, so hopefully someone who knows the answer can enlighten us - mixed in of course with all the confidently wrong assertions that inevitably pop up in discussions about memory model semantics.

[1]: https://www.khronos.org/blog/comparing-the-vulkan-spir-v-mem...

[2]: https://rigtorp.se/spinlock/

5 comments

Also: it remains difficult to fully nail down the semantics of sequential consistency as well, especially when it's mixed with other memory semantics. Very likely next time Russ updates his article he should add a reference to Repairing Sequential Consistency in C/C++11[1].

[1]: https://plv.mpi-sws.org/scfix/full.pdf

Thanks for the GPU insights and links (and the paper link below)!

I based my claim about Rust from https://doc.rust-lang.org/nomicon/atomics.html. ("Rust pretty blatantly just inherits the memory model for atomics from C++20.") Perhaps that is out of date?

I believe your claim is correct: https://news.ycombinator.com/item?id=27758461.
There's even more discussion on the lock memory ordering on Stackoverflow: https://stackoverflow.com/questions/61299704/how-c-standard-...

Taking a lock only needs to be an acquire operation and a compiler barrier for other lock operations. Using seq_cst or acq_rel semantics is stronger than needed. From my reading and discussions with people from WG21 the current argument for why taking a lock only requires acq semantics is that a compiler optimization that transforms a non-deadlocking program into a potentially deadlocking program is not allowed. There's an interesting twitter thread where we discuss this I can't find anymore :(.

That is an amazing thread. The fact that C++ apparently allows optimizing

    #include <stdio.h>
    
    int stop = 1;
    
    void maybeStop() {
        if(stop)
            for(;;);
    }
    
    int main() {
        printf("hello, ");
        maybeStop();
        printf("world\n");
    }
into

    int main() {
        printf("hello, world\n");
    }
(as Clang does today) does not inspire confidence about disallowing moving the loop in the other example. If the compiler is allowed to assume that this loop terminates, why not the lock loop?

Maybe there is a reason, but none of this inspires confidence.

The standard says that a thread must eventually terminate, do an atomic operation or do IO. So the while(lock.exchange(true)); loop is different.

Also keep in mind that C++11 specifies std::mutex::lock() to have acquire semantics and unlock() to have release semantics on the lock object. In order for std::mutex to actually work the reordering of m1.unlock(); m2.lock(); to m2.lock(); m1.unlock(); must be disallowed. But since m1 and m2 are separate objects m1.unlock() has no happens before relationship with m2.lock(). This seems to be a problem in the C++11 memory model. The arguments I have heard from some WG21 people is that there is no problem since transforming a wellformed terminating program into a non-terminating program is not allowed. I can't find the wording in the C++ standard that asserts this. But oh well, it works right now on gcc/llvm/msvc.

I don't remember the exact wording, but the standard explicitly makes an exception for the always terminating assumptions, for loops accessing atomic variables or having side effects (i.e. volatile or I/O).
> I'll also note that using AcqRel semantics is not provided by the Rust version of compare_exchange_weak (perhaps a nit on TFA's assertion that Rust adopts the C++ memory model wholesale), so if acquire to lock the spinlock is not adequate, it's likely it would need to go to SeqCst.

Is this true? AcqRel seems to be accepted by the compiler for the success ordering of compare_exchange_weak.

https://doc.rust-lang.org/std/sync/atomic/struct.AtomicU32.h...

It's accepted by the compiler, but if provided, it compiles to a panic.

On the page you linked panics are only mentioned for load and store and the code below seems to work just fine?

    let x = atomic::AtomicU32::new(0);
    x.compare_exchange_weak(
        0,
        1,
        atomic::Ordering::AcqRel,
        atomic::Ordering::Relaxed).unwrap();
    println!("{}", x.load(atomic::Ordering::Relaxed));
Ah, you're right. I was using the same ordering for success and failure. It is possible to use AcqRel in the success case.
Looks like in C++ memory_order_acq_rel is treated like memory_order_acquire when it's a load and memory_order_release when it's a store. I would argue that this isn't really a difference in memory model but a difference in API.
I agree. Sorry for the misdirection - the panic is something I observed when I was experimenting with it, and I misinterpreted it.
GPU-spinlocks are a bad idea, unless the spinlock is applied over the entire Thread-group.

Even then, I'm pretty sure the spinlock is a bad idea, because you probably should be using GPUs as a coprocessor and enforcing "orderings" over CUDA-Streams or OpenCL Task Graphs. The kernel-spawn and kernel-end mechanism provides you your synchronization functionality ("happens-before") when you need it.

---------

From there on out: the GPU-low level synchronization of choice is the thread-barrier (which can extend out beyond a wavefront, but only up to a block).

--------

So that'd be my advice: use a thread-barrier at the lowest level for thread blocks (synchronization between 1024 threads and below). And use kernel-start / kernel-end graphs (aka: CUDA Stream and/or OpenCL Task Graphs) for synchronizing groups of more than 1024 threads together.

Otherwise, I've done some experiments with acquire/release and basic lock/unlock mechanisms. They seem to work as expected. You get deadlocks immediately on older hardware because of the implicit SIMD-execution (so you want only thread#0 or active-thread#0 to perform the lock for the whole wavefront / thread block). You'll still want to use thread-barriers for higher performance synchronization.

Frankly, I'm not exactly sure why you'd want to use a spinlock since thread-barriers are simply higher performance in the GPU world.

In general spinlocks are a bad idea, but you do see them in contexts like decoupled look-back. As you say, thread granularity is a problem (unless you're on CUDA on Volta+ hardware, which has independent thread scheduling), so you want threadgroup or workgroup granularity.

In any case, I'm interested in pushing the boundaries of lock-free algorithms. It is of course easy to reason about kernel-{start/end} synchronization, but the granularity may be too coarse for some interesting applications.

This is the first time I've heard of the term "decoupled look-back". But I see that it refers to CUB's implementation of device-wide scan.

I briefly looked at the code, and came across: https://github.com/NVIDIA/cub/blob/main/cub/agent/agent_scan...

I'm seeing lots of calls to "CTA_SYNC()", which ends up being just a "__syncthreads" (a simple thread-barrier). See: https://github.com/NVIDIA/cub/blob/a8910accebe74ce043a13026f...

I admit that I'm looking rather quickly though, but... I'm not exactly seeing where this mysterious "spinlock" is that you're talking about. I haven't tried very hard yet but maybe you can point out what code exactly in this device_scan / decoupled look-back uses a spinlock? Cause I'm just not seeing it.

----------

And of course: a call to cub's "device scan" is innately ordered to kernel-start / kernel-end. So there's your synchronization mechanism right there and then.

I don't think CUB is doing decoupled look-back, the reference you want is: https://research.nvidia.com/publication/single-pass-parallel...

It doesn't use the word "spin" but repeated polling (step 4 in the algorithm presented in section 4.1, particularly when the flag is X) is basically the same.

3rd paragraph:

> In this report, we describe the decoupled-lookback method of single-pass parallel prefix scan and its implementation within the open-source CUB library of GPU parallel primitives

The CUB-library also states:

https://nvlabs.github.io/cub/structcub_1_1_device_scan.html

>> As of CUB 1.0.1 (2013), CUB's device-wide scan APIs have implemented our "decoupled look-back" algorithm for performing global prefix scan with only a single pass through the input data, as described in our 2016 technical report [1]

Where [1] is a footnote pointing at the exact paper you just linked.

-----------

> It doesn't use the word "spin" but repeated polling (step 4 in the algorithm presented in section 4.1, particularly when the flag is X) is basically the same.

That certainly sounds spinlock-ish. At least that gives me what to look for in the code.

Ah, good then, I didn't know that, thanks for the cite. I haven't studied the CUB code on it carefully.