|
|
|
|
|
by joebob42
1509 days ago
|
|
Am I missing why a distributed lock is an impossibility? The problem stated is that a partitioned node can't know it has lost the lock, but this is only an issue if there is a way to lose the lock short of returning it. Which I guess is to say: what difference is there between a lease with an infinite timeout unless manually returned, and a "lock"? Certainly the system deadlocks under partition but I'm not sure why that makes this "impossible". |
|
I would argue that "infinite timeout" is another negative shibboleth.
every operation in a distributed system has some duration after which you can be 99.9% confident (or 99.9999%, or whatever threshold you want to pick) that it was lost to the void and will never return a result.
in a robust distributed system, you want to pick a reasonable timeout value, and then take appropriate action in response to the timeout. typically this is retrying the operation, bubbling up a failure message to a higher level, or some combination of the two (retry a few times, fail if all the retries fail).
an infinite timeout represents a deliberate design choice of "I don't want to handle the case of this message or API call being lost in-transit and never returning either success or failure".
in my experience, infinite timeouts are often the cause of "hmm, this thing is up and running but seems 'stuck' and not making any progress, let me try manually restarting this service...OK, that seems to have recovered it" bugs and production alerts.