Hacker News new | ask | show | jobs
by jerf 4691 days ago
With a cooperative scheduler, you will, sooner or later, experience some form of starvation. The most obvious is the process that just infinitely loops, but less obvious situations will end up popping up too; calls that you thought were handled by the event loop but turn out to be blocking and add up when you start calling them at scale, strange behavior when you have a set of processes that turn out to yield far less often than you thought and the system starts behaving with much higher latency than it should if you get too many of them, and all kinds of such manifestations. There's also the inability to create things that watch other things; if something does go spinning off into infinity, nothing else gets to run to kill it.

You can hack around many of them, but you eventually hit a wall, and the effort of the hack increases rapidly.

Note I didn't really use gevent in this reply, this is just about cooperative scheduling. There's a reason why we've all but completely abandoned it at the OS level for things we'd call "computers" (as opposed to "embedded systems" or "microcontrollers", etc). I tend to consider cooperative scheduling an enormous red flag in any system that uses it... and yes, that completely and fully includes the currently-popular runtimes that use it.

1 comments

It sounds like this sort of problem comes from using a cooperative scheduler to implement concurrency of arbitrary routines rather than control flow. I haven't been in a situation in which it would even be possible for something to yield less often than I expect, because I expect it to run until it yields. Similarly I don't often find that subroutines return too infrequently because I expect them to run until they return.

This library is probably nice for the places I would otherwise use threads.

You will eventually, at scale, be wrong about that. To have full and correct knowledge of exactly how long your code takes to run sufficient to do this sort of scheduling correctly, by hand, in advance of running it, is basically equivalent to claiming that you never need to profile code because you already know exactly how long it takes. And it is well known and established to my satisfaction that even absolute, total experts in a field will still often be surprised about what actually comes out of a profiler, even in code strictly in their domains. You may well be right most of the time... but that is all you can hope for.
If it takes more than 16.67ms to run a frame's worth of update-and-draw, then it does, and replacing "wake up every in-game entity that asked to wake up this frame" with "let a preemptive scheduler manage ~10,000 threads that want to wake up, do almost nothing, and then sleep for k frames, while some master thread waits on a latch until they're all done" seems unlikely to make it any faster. If the logic my server must perform to handle a request is expensive, then it is, and replacing an event loop with a single-threaded preemptive scheduler will not increase throughput.

I'm not sure why it is difficult to do this sort of thing correctly. The scheduler does next to nothing in the "server with connections managed in coroutines" case and probably makes matters worse in the "storing game state in execution states" case. It could have a positive impact in the server application if one routine is secretly going to crash or run forever, in the sense that the other routines will continue running while the problematic feature is fenced off or fixed.