Hacker News new | ask | show | jobs
by gregjm 973 days ago
> My so-called CPU “active” time is actually an inferred value; CUDA spins the CPU 100% constantly, even when the CPU is just waiting for the GPU

The CUDA Runtime and Driver APIs allow you to use“blocking synchronization” where the CPU will go to sleep while waiting for synchronization with the device. However, it seems that PyTorch doesn’t expose this functionality in any of its Python APIs:

https://github.com/pytorch/pytorch/issues/28224

What happens when you try using ctypes to call into libcudart.so to set the device flags as described in the above issue? You’ll have to call torch.cuda.init() for it to work, and unfortunately it won’t work if PyTorch is launching kernels from other threads.

2 comments

Aha, I was hoping to learn about something like this, thanks for sharing. I'll try this some time. PyTorch does use different threads for the forward and backward pass, so as you suggest, setting that flag might only improve the forward pass.
The CUDA Runtime and Driver APIs have per-thread state, so using threads would unfortunately bypass our trick here to set the flag. Assuming you're on Linux, I might suggest creating a shared library to intercept calls to the Driver API, as all Runtime functions are implemented as wrappers around Driver functions. You'd have to intercept all calls to context creation and flag setting:

  * `cuCtxCreate`

  * `cuCtxCreate_v3`

  * `cuCtxSetFlags`

  * `cuDevicePrimaryCtxRetain`

  * `cuDevicePrimaryCtxSetFlags`
... and make sure that the three least significant bits of any `flags` variable are set to `CU_CTX_SCHED_BLOCKING_SYNC`.

cuDevicePrimaryCtxSetFlags: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PR...

dlsym(3): https://man.archlinux.org/man/dlsym.3.en

ld.so(8): https://man.archlinux.org/man/ld.so.8.en

I’m somewhat confused as to what is exposed, as the description in the quote sounds like a blocking call, but with a busy wait, which seems like it couldn’t be the only or main thing that PyTorch exposes.
Not just that: you can perfectly happily poll a marker you inserted into the CUDA stream, interspersed with sched_yield() syscalls to let other processes get work done in between you checking if the GPU got to a point where you can retrieve (as/if relevant) results and submit new work. You would have to dial the scheduler time slice to not keep those other processes running long enough after you yielded for your queue of submitted work to run dry before you get to top that queue off. This isn't as critical when you can completely fill the scheduler queue (I remember ~1000 entries, but it's been years and I haven't checked again if I even remembered correctly. Don't rely on this!), as you may want to force sleep there for some millisecond(s) to keep the CPU core sleeping instead of merely allowing other processes to get work done.
That is indeed the only API that it exposes.