| > a lease with an infinite timeout unless manually returned I would argue that "infinite timeout" is another negative shibboleth. every operation in a distributed system has some duration after which you can be 99.9% confident (or 99.9999%, or whatever threshold you want to pick) that it was lost to the void and will never return a result. in a robust distributed system, you want to pick a reasonable timeout value, and then take appropriate action in response to the timeout. typically this is retrying the operation, bubbling up a failure message to a higher level, or some combination of the two (retry a few times, fail if all the retries fail). an infinite timeout represents a deliberate design choice of "I don't want to handle the case of this message or API call being lost in-transit and never returning either success or failure". in my experience, infinite timeouts are often the cause of "hmm, this thing is up and running but seems 'stuck' and not making any progress, let me try manually restarting this service...OK, that seems to have recovered it" bugs and production alerts. |
Zombie processes (dependent on some lock that will never clear) shouldn't be possible. At the very least abort (kill -9) should always be possible.
Failure should always be an option; it should be the default assumption. All other order must be wrested from that chaos.