|
|
|
|
|
by cogman10
1800 days ago
|
|
I agree. Coroutines MIGHT be more efficient if what you end up building is a statemachine anyways (as that's what most of those coroutines are doing with the compiler). Otherwise, if it's just pure parallel CPU/memory burning with little state transitions/dependence then a dedicated CPU pool fixed to roughly the number of CPU cores on the box will be the most efficient. Heck, it can often even yield benefits to "pin" certain tasks to a thread to keep the CPU cache filled with relent data. For example, 4 threads handling the 4 quadrants of the matrix rather than having the next available thread picking up the next task. |
|
Its I/O to send data to and from a GPU, and therefore its an I/O bound task somewhat. But there's also a significant amount of CPU work involved. Ideally, you want to balance CPU-work and GPU-work to maximize the work being done.
Fortunately, CUDA-streams seems like they'd mesh pretty well with coroutines (if enough code were there to support them). But if you're reaching for the "GPU-button", everything is compute-bound (if not, you're "doing it wrong"). So now you have a question of "how much to oversubscribe?"
Then again, that's why you just make the oversubscription-factor a #define and then test a lot to find the right factor.... EDIT: Or maybe you oversubscribe until the GPU / CPU runs out of VRAM / RAM. Oversubscription isn't really an issue with coroutines that are executed inside of a thread-pool: you aren't spending any CPU-time needlessly task-switching.