| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zelphirkalt 815 days ago

> However, what are the advantages of rescheduling an already running coroutine on a different machine?

Your original machine might not have the required computational resources to run the job quickly enough.

> Shouldn't it be the job of the load balancer, to choose the least busy machine, before any coroutine is run?

It is not always simple or possible to know, when what part of a computation will be finished and as such cannot easily be perfectly planned. If you mean a load balancer as in traefik or similar, then that entails serializing your intermediate results or writing your code in a specific way, so that computation can be split using a load balancer.

> Isn't it expensive to serialize coroutines and transfer them between machines?

Probably, but not taking advantage of idle cores on another machine might be more expensive (in terms of time needed to finish the computation).

Also, I wonder what observability is like. If a coroutine crashes, what its stacktrace will look like?

No idea, have not used it.

This kind of thing is what Erlang excels at. Serializing functions and their entire environments is a difficult to solve problem. I think in Python it probably means moving into a whole different space of types and objects, because Python has not been developed with such a thing in mind from the start, while Erlang has.

2 comments

kgeist 814 days ago

>Probably, but not taking advantage of idle cores on another machine might be more expensive (in terms of time needed to finish the computation).

If there are idle cores on another machine, why not just point the load balancer to schedule new jobs/coroutines on those idle machines? Why reschedule existing coroutines on another machine if they run just fine on the current node? (i.e. the node is at full capacity, and it's fine) I can see that the problem can arise if a coroutine wants to spawn new coroutines, and if we schedule them on the same node which is at full capacity, they can end up waiting for a while before they are run... It makes sense to schedule new jobs on different machines, but why reschedule existing coroutines? You have to serialize/deserialize stuff and send over the wire, and there are several gotchas as explained in the article (pickling arbitrary objects doesn't sound very reliable/safe, and judging from the article, can break in future versions of Python), plus there's a lot of magic involved which will probably make it harder to investigate things in production... I'd personally just stick with local coroutine sheduler + global load balancer which picks nodes for new jobs + coroutines, when created, only receive plain values as arguments (string, int, float and basic structures) to be reliably serializable and so that it was transparent what's going on... (i.e. do NOT store internal coroutine state, assume they are transient). Maybe I don't understand the idea.

link

achille-roussel 814 days ago

Distributed coroutines are a primitive to express transactional workflows that may last longer than the initial request/response that triggered it (think any form of async operation). While the distribution allows effective use of compute resources, capturing the state of coroutines and their progress is the key addition that enables the execution of workflows and guarantees completion.

A load balancer can help distribute new jobs across a fleet, but even the shortest of jobs can become "long running" when it hits timeouts, rate limits, and other transient errors. You quickly need a scheduler to effectively orchestrate the retries without DDoS-ing your systems, and need to keep track of the state to carry jobs to completion.

Combine a scheduler (like Dispatch) with a primitive like distributed coroutines, and you've got a powerful foundation to create distributed applications of all kinds without seeing complexity skyrocket.

link

kgeist 814 days ago

OK, from what I understand, it's similar to what we do as well, except Dispatch adds magic while we do it all manually. We have an event-based system: instead of await points, we fire events which are stored inside an AMQP broker. The broker has N consumers on different nodes which take new jobs as they arrive. Retries/circuit breakers etc. are added manually (via a Go library), and if a job/event handler fails, it's readded back to the AMQP queue (someone else will process it later). Inside event handlers/job processors we also enjoy Go's builtin local scheduler (so I/O calls do not block entire cores).

I can see the benefit that with Dispatch, logic is simpler to read/to write as just ordinary functions, while in our approach, we have to scatter it around various event handlers/job processors. However, I still like that in our approach, event handlers/job processors are entirely stateless (the only state is jobs/event payloads), I've found it to be good for scalability and reliability + easier to reason about, compared to passing around internal coroutine state.

link

achille-roussel 814 days ago

Yes, that sounds very similar indeed. We've launched Dispatch because this is a universal problem that engineering teams end up having to reinvent over and over.

Dispatch can also handle the "one-off" jobs you describe, where you don't need to track the coroutine state. In a way, it's a subset/special case of the distributed coroutine (just like functions are a special case of coroutines with no yield point).

link

achille-roussel 814 days ago

Your comment on this being solved by inventing new programming languages like Erlang is right on point.

Our take is that distributed coroutines can be brought to general-purpose programming languages like Python so that engineering teams can adopt these features incrementally in their existing applications instead of adding a whole new language or framework.

In my opinion, the value is in being able to reuse the software tools and processes you're familiar with, major shifts are rarely the right call.

link