|
|
|
|
|
by dragontamer
1816 days ago
|
|
This is the first time I've heard of the term "decoupled look-back". But I see that it refers to CUB's implementation of device-wide scan. I briefly looked at the code, and came across: https://github.com/NVIDIA/cub/blob/main/cub/agent/agent_scan... I'm seeing lots of calls to "CTA_SYNC()", which ends up being just a "__syncthreads" (a simple thread-barrier). See: https://github.com/NVIDIA/cub/blob/a8910accebe74ce043a13026f... I admit that I'm looking rather quickly though, but... I'm not exactly seeing where this mysterious "spinlock" is that you're talking about. I haven't tried very hard yet but maybe you can point out what code exactly in this device_scan / decoupled look-back uses a spinlock? Cause I'm just not seeing it. ---------- And of course: a call to cub's "device scan" is innately ordered to kernel-start / kernel-end. So there's your synchronization mechanism right there and then. |
|
It doesn't use the word "spin" but repeated polling (step 4 in the algorithm presented in section 4.1, particularly when the flag is X) is basically the same.