|
|
|
|
|
by storystarling
141 days ago
|
|
It is less about the raw transfer speed and more about the synchronization and kernel launch overheads. If you profile a standard inference loop with a batch size of 1 you see the GPU spending a lot of time idle waiting for the CPU to dispatch the next command. That is why optimizations like CUDA graphs exist, but moving the control flow entirely to the device is the cleaner solution. |
|
Dispatch has overheads, but it's largely insignificant. Where it otherwise would be significant:
1. Fused kernels exist
2. CUDA graphs (and other forms of work-submission pipelining) exist