| I also wrote about distributed locking a while ago, particularly about implementing one on Google Cloud Storage. https://www.joyfulbikeshedding.com/blog/2021-05-19-robust-di... This is useful because cloud storage is very cheap and serverless. It certainly beats running a Redis or PostgreSQL instance. In my research and implementation I took care of the problems described in this article (as well as the problems I encountered in other distributed lock implementations). Clients either freezing or outright crashing or disappearing is definitely a major problem. So timeouts are a must. But you can't just have a timeout because you don't know whether a client is arbitrarily delayed or whether it has disappeared. Furthermore, some workloads have an unknown completion time so you can't pick any sensible timeout. I address this by encouraging clients to behave as follows: 1. Make the timeout relatively short, but regularly refresh your lease to let others know that you're still alive. 2. Your lease may at any time be taken over by another client that thinks you've disappeared. So you must regularly check whether your lease on the lock is still valid. This checking must be done way, way before your timeout expires, in order to account for the possibility that you can be arbitarily delayed at any moment (GC pauses, scheduling pauses, network delays, etc). I recommend checking in an interval at most 1/8 of the timeout. Smaller intervals are better at the expense of more overhead. 3. Attempt to split up your work into a long-running, discardable preparation phase and a short-running commit phase. Verify the validity of your lock lease right before committing (in addition to periodic verification). This is of course still not 100% foolproof. There is still risk that two clients run at the same time, but this risk can be arbitrarily reduced by reducing the verification interval and by optimizing for doctrine 3. If your workload cannot tolerate any risk of two clients running at the same time, no matter how small the risk, then pick an infinitely timeout. But I think for most workloads, the benefits of not having to administer the lock (not having to manually break its lease because a client has crashed) outweight the risk. My algorithm has a Ruby implementation, as well as a third-party Go implementation. Ruby: https://github.com/FooBarWidget/distributed-lock-google-clou... Go: https://github.com/FooBarWidget/distributed-lock-google-clou... |