|
|
|
|
|
by spenczar5
833 days ago
|
|
What happens if a worker goes silent for longer than the heartbeat duration, then a new worker is spawned, then the original worker “comes back to life”? For example, because there was a network partition, or because the first worker’s host machine was sleeping, or even just that the first worker process was CPU starved? |
|
We currently send cancellation signals for individual tasks to workers, but our cancellation signals aren't replayed if they fail on the network. This is an important edge case for us to figure out.
There's not much we can do if the worker ignores that signal. We should probably add some alerting if we see multiple responses on the same task, because that means the worker is ignoring the cancellation signal. This would also be a problem if workloads start blocking the whole thread.