Hacker News new | ask | show | jobs
by drabinowitz 2415 days ago
You can include a locked_at field and have your update query be for not_started rows and started rows where locked_at is older than the job timeout
1 comments

A global job timeout might be unreasonable with high variance in workload. Eg some jobs taking 0.5 seconds and others 30 seconds. You might set a global timeout of say 60s but it sucks to wait 59.5s to reap that short job whose worker crashed. A better system is to make workers update a timestamp on an interval and you reap any jobs that haven't been updated in N seconds.
It's a trade off between updates per sec and latency.

Maybe simply using a timeout per job type is a better way. (That of course trades off simplicity.)

I agree. Frequency of updates also becomes more of an issue as you add workers. Say you have 1000 workers each updating every 2 seconds. That's ~500 timestamp update statements per second which is not trivial in terms of added load on the DB.