Hacker News new | ask | show | jobs
by cogman10 1800 days ago
I agree.

Coroutines MIGHT be more efficient if what you end up building is a statemachine anyways (as that's what most of those coroutines are doing with the compiler). Otherwise, if it's just pure parallel CPU/memory burning with little state transitions/dependence then a dedicated CPU pool fixed to roughly the number of CPU cores on the box will be the most efficient.

Heck, it can often even yield benefits to "pin" certain tasks to a thread to keep the CPU cache filled with relent data. For example, 4 threads handling the 4 quadrants of the matrix rather than having the next available thread picking up the next task.

2 comments

The one that gives me a headache is thinking about how to oversubscribe a GPU (or worse: 4 GPUs, as in the case of the Summit supercomputer).

Its I/O to send data to and from a GPU, and therefore its an I/O bound task somewhat. But there's also a significant amount of CPU work involved. Ideally, you want to balance CPU-work and GPU-work to maximize the work being done.

Fortunately, CUDA-streams seems like they'd mesh pretty well with coroutines (if enough code were there to support them). But if you're reaching for the "GPU-button", everything is compute-bound (if not, you're "doing it wrong"). So now you have a question of "how much to oversubscribe?"

Then again, that's why you just make the oversubscription-factor a #define and then test a lot to find the right factor.... EDIT: Or maybe you oversubscribe until the GPU / CPU runs out of VRAM / RAM. Oversubscription isn't really an issue with coroutines that are executed inside of a thread-pool: you aren't spending any CPU-time needlessly task-switching.

And, TBF, a lot of the IO stuff comes down to specifically talking about what sort of device you are talking to and where.

For a lot of the programming I do (and I'm sure a lot of others on HN) IO is almost all network IO. For that, because it's so slow and everything is working over DMA anyways, coroutines end up working really well.

However, once you start talking about on system resources such as SSDs or the GPU, it gets more tricky. As you rightly point out, the GPU is especially bad because all GPU communication ends up being routed through the CPU. At least for a HD, there's DMA which cuts down on the amount of CPU work that needs to be done to access a bit of data.

Only stackless co-routines require state machine transformation. Stackfull co-routines based user mode threading generally just change out the IO primitives to issue an asynchronous version of the operation, and immediately calls into the the user mode scheduler to pick some ready-to-resume co-routine to switch the stack to and resume. They might include a preemption facility (beyond just the OS's preemption of the underlying kernel threads), but that is not required and is largely a language/runtime design decision.

The big headaches with stackfull co-routine based user mode threading come from two sources. One is allocating the stack. If your language requires a contiguous stack then you either need to make the stacks small, and risk running out, or make them big which can be a problem on 32-bit platforms (you can run out of address space), or can be a problem on some platforms (those with strict commit-charge based memory accounting). Both can be mitigated by allowing non-contiguous stacks or re-locatable contiguous stacks (to allow small stacks to grown later without headaches), although obviously that can have performance considerations.

The other stackfull co-routine headache is in calling into code from another language (i.e. FFI) which could be making direct blocking system calls, and end up starving you of your OS threads.

I do agree that in purely CPU or memory bound applications a classical thread pool makes better sense. The main advantages of either type of co-routine based user mode threading primarily apply to IO-heavy or mixed workloads.