Hacker News new | ask | show | jobs
Show HN: Lockable – sync locks for distributed systems (lockable.dev)
77 points by keyless_ 1466 days ago
Hi guys, creator of lockable here - the easiest way to think of lockable is as `flock` for when you don’t have a shared file system. You can use it to control concurrent access to resources or to ensure only a single instance of a process runs at any given time.

Your processes can acquire, refresh and release locks via simple HTTP requests, so it’s language/framework agnostic. E.g. with `curl`:

    $ curl https://lockable.dev/api/acquire/my-lock-name
    {
      "response": true //or false, if the lock wasn’t available
    }


    $ curl https://lockable.dev/api/release/my-lock-name
There’s also a Python client[0], which makes using the service a more pleasant experience.

Feel free to play around, the free tier is fully functional. Happy to hear any feedback you might have.

[0]: https://docs.lockable.dev/en/latest/python-client/

14 comments

This is a misleading and dangerous service.

You provide a distributed lease, not a lock. A distributed lease by itself doesn’t provide mutual exclusion. Distributed leases are typically accompanied with a fencing token (which your service cannot provide out of band) or an optimistic lock on the underlying "exclusive" resources (which could be implemented by the consumer of your service). I think of distributed leases as an optimization e.g. they provide soft exclusion in the happy path, which may reduce thrashing on the underlying real mutual exclusion mechanism (like an optimistic lock) under general use.

Your docs and terminology guide users into building incorrect systems e.g “ensure only a single instance of a process runs at any given time” is simply not true.

edit: I had a quick look at your Python library, and while Python isn't my forte, it looks like it can be trivially induced to break the mutual exclusion mirage: the "requests" lib's default request timeout is None e.g infinite, so what happens when `client.try_heartbeat` goes out to lunch indefinitely? Looks like `locks._run_heartbeat_loop` will hang, the lease will never be renewed, and within 60 seconds a competing lease holder will claim it. Boom. So you fix the timeout issue and problem solved right? Still no dice... what happens when you hit a pathological hang in your non-real-time runtime or OS? Boom.

Can't resist: Overheard in Stanford locker room after Show HN. "He got me" L said of D's dunk over him. That fking D boomed me. L added "He's so good", repeating it four times. L then said he wanted to add D to the list of players he works out with this summer.

Numerous AWS (really early) internal eng talks I saw, hammered home the point that distributed locking / leasing was one of the hardest problems to engineer for, if not the hardest, often quoting Chubby and Paxos Made Live papers.

Today, if I were to need a form of distributed co-ordination, I'd reach for Durable Objects to see if it fits my needs.

Those are valid points.

The Python client implementation can be improved, for sure. In particular, the pathological case that is difficult to deal with is the one where the heartbeat thread pauses at the worst possible time (the "pathological hang" you mention) and so the main thread doesn't notice its lease has expired.

Lockable is designed as an easy-to-use drop-in locking mechanism which requires minimal changes client-side. The downside is that this puts the onus on the client implementation to fully cover all pathological cases.

I agree that the documentation should be clearer about these limitations and requirements on the client.

Thanks for the feedback, it's really useful!

If I understood correctly, the problem you're describing is essentially the following. The locks have a timeout to them to avoid zombie locks etc. However this means we can have system A obtain a lock (or a lease, if you prefer) and begin a long-running process on the given resource. However, during the long-running process, the lock times out and system B can acquire a lock, thinking it's the only one using the resource, leading to a conflict. Did I understand you correctly?
We once had a bug from this incorrect use of "distributed locks". A server we accessed under a lock suddenly started lagging past the timeout of the lock, another server using the lock assumed the lock was released (i.e. timed out) and acquired it, while the original server assumed it still owned the lock. Data corruption occurred.

This implementation has "heartbeats" so I wonder whether it solves the problem.

Heartbeats does not solve it. You need fencing tokens to reject writes if the lock has expired.

See this amazing article by Martin Kleppman, author of Designing Data-Intensive Applications.

"How to do distributed locking"

https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...

It's really up to the client implementation.

In order to deal with long-running processes, the client Python implementation uses a separate thread for sending periodic heartbeats to the lockable server, which serves to do 2 things:

  1. renew the lease so it doesn't expire which would release the lock
  2. notify the main worker thread in the even the lock has been lost
The GP's point was that the heartbeat thread can hang in pathological cases, which means the main worker thread would not be notified that it has lost the lock.

This can be addressed in a few ways - one way being by adding fencing tokens[0]. However, that requires modifying the underlying resource you are accessing.

[0]: https://ebrary.net/64834/computer_science/fencing_tokens

What's the advantage of heartbeats over a simpler implementation via SETNX in Redis if you still need fencing tokens?
> This is a misleading and dangerous service.

How is it dangerous? What danger would the user be in?

Building software that relies on an incorrect assumption sounds dangerous to me.

There's plenty of critical software in the world, some of which has real world consequences if there are bugs.

The user could create software that thinks a resource is locked under their exclusive control, but is not. This can lead to data corruption, which could lead to even errors propagating throughout a system.
This is basically the SETNX Redis command, as a service. Redis servers are already quite the cheap commodity. I think the success of this service really all comes down to performance, uptime, and scalability. There's a lot of overhead with HTTP(S) which will certainly hurt perf.

https://redis.io/commands/setnx/

does http/3 and/or gRPC help with that?
How do you ensure that a lock isn't acquired by two requests? Do you use atomic compare and set?

How do you release a lock reliably? How do you solve the problem of releasing accidentally while using a resource?

Can the lock jam locked if the process dies?

I would use Consul for this or I would try avoid needing to lock to begin with.

Even better is to use a language such as bloom Lang.

Good questions:

> How do you ensure that a lock isn't acquired by two requests?

Indeed, the compare and set is done atomically. It's guaranteed that the lock can only be acquired by at most one process.

> How do you release a lock reliably? How do you solve the problem of releasing accidentally while using a resource?

A lock is released in one of two conditions:

  1. the /release endpoint is called
  2. the lease on the lock expires
I'm not sure what you mean by "releasing accidentally" - if nobody calls /release then the lock won't be released.

> Can the lock jam locked if the process dies?

Locks come with a lease which expires after a set amount of time. If lockable doesn't receive a heartbeat to renew the lease, the lock is released automatically.

> I would use Consul for this or I would try avoid needing to lock to begin with.

For sure Consul is an alternative, so are ZooKeeper and things like ETCD - lockable is intended to be a no-setup alternative to something like that.

I'm thinking of a case where a process thinks it has a lock and tries to do things even though it doesn't have the lock.

This happened on a project I was on, two processes would join a RabbitMQ on a service that was not safe to load balance.

I suspect it could happen if the program using your lock service is not implemented properly and the lease expires but the program doesn't realise.

Ultimately, you have to trust your clients to do the right thing.

The lockable server guarantees that e.g. if multiple /acquire requests come in for the same lock in a short time span, only one request will be successful. You are correct in that, without care, there can be pathological cases where a client may not realize their lease has expired.

etcd and zookeeper lock are based on a session.

You open a session and the session open many locks.

if you can’t talk to server the session expire and all lock are released by the server but also the client library notify that the sessions expired.

Also etcd and consul and zookeeper use The Raft consensus algorithm. it’s literally impossible to make a lock server that is safe without using a consensus algorithm for replication.

Have you considered a version that holds a lock as long as it maintains an HTTP connection (with periodic data sent to keep the connection alive), and when the connection closes, the lock is released? That would help prevent the lock from staying held.
Conceptually, that's how it works, however, instead of keeping an HTTP connection open, lockable expects to receive periodic heartbeats /hearbeat to keep the lock acquired. The TTL is variable, and can be set depending on the use case.
This can be a useful service, but it needs durability to avoid race conditions.

If the service goes down and restarts, it will it lose all locks currently held. At this point, a new client could obtain the lock, while an older client that is still blissfuly unaware of the service going down may proceed assuming that its lease is still valid.

If it is implemented as a single server, then it must persist the lock info to disk before granting a lock. Or one can get durability using replication. In which case I would just as soon use zookeeper.

This is misleading and a bit dangerous...

You'd also need to "sync" the lock resource, which is accessed over REST. At the very least, you need some sort of Idempotency-Key for your REST API.

(Imagine the scenario where /acquire succeeds on the server, but the network response fails before the client gets to read it. The client retries. How does the server know it's the same request?)

As these locks automatically expire, it seems scary to me that losing network connectivity for >TTL can break mutual exclusion guarantees.
Unfortunately, you do need _some_ sort of TTL because there is no way for the lockable server to know when a remote process has simply died.

However, TTLs can be set to be arbitrarily long (e.g. multiple days), which means, in practice, you can avoid losing locks due to networking issues.

You can instead have manual overrides. Not seeing a lock means that at least a person has decided it's safe to run Vs something simply have taken too long to respond.
If TTLs can be arbitrary lengths, wouldn't setting them to high values (a week, month, year, whatever) allow you to implement whatever manual override mechanism you wanted?
so it’s just a lease server but without the strong linearizability guarantee you would get with ETCD and zookeepers?
Lockable itself isn't distributed (or, at the very least, doesn't appear distributed from the outside), so I'm not sure if linearizability applies here, but maybe I'm misunderstanding you comment.

In a similar vein, I guess you can ask what happens if a client first checks if a lock is available then tries to acquire it, in two separate steps; but in that case, there's no guarantee that the lock wasn't acquired in between checking and acquisition.

basically if it does not use synchronous replication. and master server switch then 2 client could think they own an unexpired lock.

But if it does synchronous replication without using a consensus algorithm like Paxos or Raft then the system become unwrittable if an instance go down.

I recall using https://github.com/awslabs/amazon-dynamodb-lock-client previously, this is basically a managed version. Interesting.
Indeed, the first version of this heavily borrowed from this approach.

The trickiest part is correctly writing the logic that spawns the secondary thread used for heartbeats.

What's the difference between this and e.g. using a Postgres db?
Probably ease of setting up and not needing to manage a piece of infrastructure - which is the case with all service offerings, really.

It's not a difficult conceptual task to keep track of some locks in a Postgres DB (or use PG advisory locks), but you still need to:

  * make sure all processes can access the db (directly or indirectly)
  * make sure the db can handle all the connections (or set up PGBouncer if you think you're going to be handling many processes at the same time)
  * write some client-side logic to acquire locks, retry on failure etc.
they are both equally unsafe to use as a storage for distributed lock.

unless the only thing you are trying to protect with your lock is access to other rows in the same postgresql server

If you’re using distributed locks/leases you probably messed up somewhere. I think there’s no way you write correct code using this.
Random thought - doesn't S3 have a leasing/locking system that could be used for distributed locks?
Unfortunately, it doesn't. That was actually the reason I built this originally - there's no easy way to control read/write access to a file on S3 without using some sort of external locking system.
You can add locks to objects, which restricts who can delete them and how, but requires bucket wide settings.
BTW, this is true, but that is not locking in the "sync locking" sense.
It's not the intended purpose but they can be used like this. Lock files can be written, will fail if they already exist - that gets you a distributed equivalent to flock doesn't it?
Have you looked at Azure Blob Storage's lock feature? Could that be used?
Did Zookeeper or Etcd not suit your needs?
They are valid solutions for this use-case; the main drawback is you need to set up and maintain a ZooKeeper cluster. Lockable is intended as a simpler and faster-to-use alternative.
AWS recommends using dynamo and has a Java reference implementation. Afaik GCP doesn't support locks either. Azure supports locks on blob store (their object store service)
Ah, someone mentioned it in reference to Azure Blob Storage recently, and I had (apparently incorrectly) assumed it was also an S3 feature.

Any reason Azure Blob Storage couldn't be used as a no-setup, managed, cheap way to acquire distributed locks?

azure blob store is literally used as a cheap way to acquire distributed lock today by the azure sdk.

if you look at azure event hub library they use blob as lock to partition the consumer instances at runtime.

What kind of performance can one expect?
Unlike locks, can a second instance call the heartbeat api without acquiring the lock?
Yes, any process can call a heartbeat. If the lock hadn't been acquired (by any process), the heartbeat call fails.