| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dragontamer 1802 days ago

The one that gives me a headache is thinking about how to oversubscribe a GPU (or worse: 4 GPUs, as in the case of the Summit supercomputer).

Its I/O to send data to and from a GPU, and therefore its an I/O bound task somewhat. But there's also a significant amount of CPU work involved. Ideally, you want to balance CPU-work and GPU-work to maximize the work being done.

Fortunately, CUDA-streams seems like they'd mesh pretty well with coroutines (if enough code were there to support them). But if you're reaching for the "GPU-button", everything is compute-bound (if not, you're "doing it wrong"). So now you have a question of "how much to oversubscribe?"

Then again, that's why you just make the oversubscription-factor a #define and then test a lot to find the right factor.... EDIT: Or maybe you oversubscribe until the GPU / CPU runs out of VRAM / RAM. Oversubscription isn't really an issue with coroutines that are executed inside of a thread-pool: you aren't spending any CPU-time needlessly task-switching.

1 comments

cogman10 1802 days ago

And, TBF, a lot of the IO stuff comes down to specifically talking about what sort of device you are talking to and where.

For a lot of the programming I do (and I'm sure a lot of others on HN) IO is almost all network IO. For that, because it's so slow and everything is working over DMA anyways, coroutines end up working really well.

However, once you start talking about on system resources such as SSDs or the GPU, it gets more tricky. As you rightly point out, the GPU is especially bad because all GPU communication ends up being routed through the CPU. At least for a HD, there's DMA which cuts down on the amount of CPU work that needs to be done to access a bit of data.