Hacker News new | ask | show | jobs
by diggs 1464 days ago
This is a misleading and dangerous service.

You provide a distributed lease, not a lock. A distributed lease by itself doesn’t provide mutual exclusion. Distributed leases are typically accompanied with a fencing token (which your service cannot provide out of band) or an optimistic lock on the underlying "exclusive" resources (which could be implemented by the consumer of your service). I think of distributed leases as an optimization e.g. they provide soft exclusion in the happy path, which may reduce thrashing on the underlying real mutual exclusion mechanism (like an optimistic lock) under general use.

Your docs and terminology guide users into building incorrect systems e.g “ensure only a single instance of a process runs at any given time” is simply not true.

edit: I had a quick look at your Python library, and while Python isn't my forte, it looks like it can be trivially induced to break the mutual exclusion mirage: the "requests" lib's default request timeout is None e.g infinite, so what happens when `client.try_heartbeat` goes out to lunch indefinitely? Looks like `locks._run_heartbeat_loop` will hang, the lease will never be renewed, and within 60 seconds a competing lease holder will claim it. Boom. So you fix the timeout issue and problem solved right? Still no dice... what happens when you hit a pathological hang in your non-real-time runtime or OS? Boom.

4 comments

Can't resist: Overheard in Stanford locker room after Show HN. "He got me" L said of D's dunk over him. That fking D boomed me. L added "He's so good", repeating it four times. L then said he wanted to add D to the list of players he works out with this summer.

Numerous AWS (really early) internal eng talks I saw, hammered home the point that distributed locking / leasing was one of the hardest problems to engineer for, if not the hardest, often quoting Chubby and Paxos Made Live papers.

Today, if I were to need a form of distributed co-ordination, I'd reach for Durable Objects to see if it fits my needs.

Those are valid points.

The Python client implementation can be improved, for sure. In particular, the pathological case that is difficult to deal with is the one where the heartbeat thread pauses at the worst possible time (the "pathological hang" you mention) and so the main thread doesn't notice its lease has expired.

Lockable is designed as an easy-to-use drop-in locking mechanism which requires minimal changes client-side. The downside is that this puts the onus on the client implementation to fully cover all pathological cases.

I agree that the documentation should be clearer about these limitations and requirements on the client.

Thanks for the feedback, it's really useful!

If I understood correctly, the problem you're describing is essentially the following. The locks have a timeout to them to avoid zombie locks etc. However this means we can have system A obtain a lock (or a lease, if you prefer) and begin a long-running process on the given resource. However, during the long-running process, the lock times out and system B can acquire a lock, thinking it's the only one using the resource, leading to a conflict. Did I understand you correctly?
We once had a bug from this incorrect use of "distributed locks". A server we accessed under a lock suddenly started lagging past the timeout of the lock, another server using the lock assumed the lock was released (i.e. timed out) and acquired it, while the original server assumed it still owned the lock. Data corruption occurred.

This implementation has "heartbeats" so I wonder whether it solves the problem.

Heartbeats does not solve it. You need fencing tokens to reject writes if the lock has expired.

See this amazing article by Martin Kleppman, author of Designing Data-Intensive Applications.

"How to do distributed locking"

https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...

It's really up to the client implementation.

In order to deal with long-running processes, the client Python implementation uses a separate thread for sending periodic heartbeats to the lockable server, which serves to do 2 things:

  1. renew the lease so it doesn't expire which would release the lock
  2. notify the main worker thread in the even the lock has been lost
The GP's point was that the heartbeat thread can hang in pathological cases, which means the main worker thread would not be notified that it has lost the lock.

This can be addressed in a few ways - one way being by adding fencing tokens[0]. However, that requires modifying the underlying resource you are accessing.

[0]: https://ebrary.net/64834/computer_science/fencing_tokens

What's the advantage of heartbeats over a simpler implementation via SETNX in Redis if you still need fencing tokens?
> This is a misleading and dangerous service.

How is it dangerous? What danger would the user be in?

Building software that relies on an incorrect assumption sounds dangerous to me.

There's plenty of critical software in the world, some of which has real world consequences if there are bugs.

The user could create software that thinks a resource is locked under their exclusive control, but is not. This can lead to data corruption, which could lead to even errors propagating throughout a system.