| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by spenczar5 833 days ago
	What happens if a worker goes silent for longer than the heartbeat duration, then a new worker is spawned, then the original worker “comes back to life”? For example, because there was a network partition, or because the first worker’s host machine was sleeping, or even just that the first worker process was CPU starved?

1 comments

abelanger 833 days ago

The heartbeat duration (5s) is not the same as the inactive duration (60s). If a worker has been down for 60 seconds, we reassign to provide some buffer and handle unstable networks. Once someone asks we'll expose these options and make them configurable.

We currently send cancellation signals for individual tasks to workers, but our cancellation signals aren't replayed if they fail on the network. This is an important edge case for us to figure out.

There's not much we can do if the worker ignores that signal. We should probably add some alerting if we see multiple responses on the same task, because that means the worker is ignoring the cancellation signal. This would also be a problem if workloads start blocking the whole thread.

spenczar5 833 days ago

Right, I meant inactive duration, of course.

Cancellation signals are tricky. You of course cannot be sure that the remote end receives it. This turns into the two generals problem.

Yes, you need monitoring for this case. I work on scientific workloads which can completely consume CPU resources. This failure scenario is quite real.

Not all tasks are idempotent, but it sounds like a prudent user should try to design things that way, since your system has “at least once” execution of tasks, as opposed to “at most once.” Despite any marketing claims, “exactly once” is not generally possible.

Good docs on this point are important, as is configurability for cases when “at most once” is preferable.