|
|
|
|
|
by diggs
1464 days ago
|
|
This is a misleading and dangerous service. You provide a distributed lease, not a lock. A distributed lease by itself doesn’t provide mutual exclusion. Distributed leases are typically accompanied with a fencing token (which your service cannot provide out of band) or an optimistic lock on the underlying "exclusive" resources (which could be implemented by the consumer of your service). I think of distributed leases as an optimization e.g. they provide soft exclusion in the happy path, which may reduce thrashing on the underlying real mutual exclusion mechanism (like an optimistic lock) under general use. Your docs and terminology guide users into building incorrect systems e.g “ensure only a single instance of a process runs at any given time” is simply not true. edit: I had a quick look at your Python library, and while Python isn't my forte, it looks like it can be trivially induced to break the mutual exclusion mirage: the "requests" lib's default request timeout is None e.g infinite, so what happens when `client.try_heartbeat` goes out to lunch indefinitely? Looks like `locks._run_heartbeat_loop` will hang, the lease will never be renewed, and within 60 seconds a competing lease holder will claim it. Boom. So you fix the timeout issue and problem solved right? Still no dice... what happens when you hit a pathological hang in your non-real-time runtime or OS? Boom. |
|
Numerous AWS (really early) internal eng talks I saw, hammered home the point that distributed locking / leasing was one of the hardest problems to engineer for, if not the hardest, often quoting Chubby and Paxos Made Live papers.
Today, if I were to need a form of distributed co-ordination, I'd reach for Durable Objects to see if it fits my needs.