It seems like a lock should be able to autorelease in a distributed environment if the acquirer is no longer available. Would this not be considered "safe?"
The proposed algorithm provides a safety guarantee which is time bound: once the lock is acquired it has a specified validity time, after this time, it is possible for another client to reacquire it.
In practical terms this forces you to have the protected code path to be "real time", which is, guaranteed to terminate (or to abort) without the specified time.
>In practical terms this forces you to have the protected code path to be "real time", which is, guaranteed to terminate (or to abort) without the specified time.
You could keep re-acquiring the lock when it gets close to expiring, and stop your process if the lock becomes un-acquirable.
Yes, this is a good strategy. We are even guaranteed to be able to re-acquire the log if we send the reacquire request in time, and there are no new partitions, because in order to reacquire the lock it is possible to send a script that checks if the value matches, and if so, we can extend the expire of the keys. Basically it is possible for the lock holder to reacquire by extending the duration of the previous lock before it expires.
It seems like a timeout is less reliable/safe than some broadcast/ping mechanism that can check availability perpetually and if a node has disappeared the validity of the lock changes.
Trying to remember which distributed system model it is that sort of does this. Ring? Mesh?
For such a model to work I believe you need a distributed replicated state machine, and the clients to be an active part of the distributed system (not just participating doing requests), being able to reply to pings. Yes, there is a safety advantage in the model you describe, as if the time taken to finish with an operation is larger than expected, the other clients may want to wait more, but in the practice:
1) What you do if the client replies to pings but takes an apparently never ending time to perform the operation on the shared resource?
2) What about if the client is correctly operating on the shared resource but the only component which is failing is the system you use to check its availability?
1) I think a mix of the two approaches would work here - an actual timeout to a lock but not the only way of keeping a lock
2) I suppose that's always possible, but then what would happen is the lock would be released. Not ideal behavior but also not one that presents a data reliability issue.
Safety? It doesn't. Reliability? because it would prevent another node from acquiring a held lock if a server is available and release a lock if a server goes down.
In practical terms this forces you to have the protected code path to be "real time", which is, guaranteed to terminate (or to abort) without the specified time.