Hacker News new | ask | show | jobs
by FooBarWidget 1129 days ago
I also wrote about distributed locking a while ago, particularly about implementing one on Google Cloud Storage. https://www.joyfulbikeshedding.com/blog/2021-05-19-robust-di...

This is useful because cloud storage is very cheap and serverless. It certainly beats running a Redis or PostgreSQL instance.

In my research and implementation I took care of the problems described in this article (as well as the problems I encountered in other distributed lock implementations).

Clients either freezing or outright crashing or disappearing is definitely a major problem. So timeouts are a must. But you can't just have a timeout because you don't know whether a client is arbitrarily delayed or whether it has disappeared. Furthermore, some workloads have an unknown completion time so you can't pick any sensible timeout.

I address this by encouraging clients to behave as follows:

1. Make the timeout relatively short, but regularly refresh your lease to let others know that you're still alive.

2. Your lease may at any time be taken over by another client that thinks you've disappeared. So you must regularly check whether your lease on the lock is still valid. This checking must be done way, way before your timeout expires, in order to account for the possibility that you can be arbitarily delayed at any moment (GC pauses, scheduling pauses, network delays, etc). I recommend checking in an interval at most 1/8 of the timeout. Smaller intervals are better at the expense of more overhead.

3. Attempt to split up your work into a long-running, discardable preparation phase and a short-running commit phase. Verify the validity of your lock lease right before committing (in addition to periodic verification).

This is of course still not 100% foolproof. There is still risk that two clients run at the same time, but this risk can be arbitrarily reduced by reducing the verification interval and by optimizing for doctrine 3. If your workload cannot tolerate any risk of two clients running at the same time, no matter how small the risk, then pick an infinitely timeout. But I think for most workloads, the benefits of not having to administer the lock (not having to manually break its lease because a client has crashed) outweight the risk.

My algorithm has a Ruby implementation, as well as a third-party Go implementation.

Ruby: https://github.com/FooBarWidget/distributed-lock-google-clou...

Go: https://github.com/FooBarWidget/distributed-lock-google-clou...

1 comments

It’s just not possible to implement a correct system using distributed locks with these types of semantics.
Every time this topic comes up, it seems people invariably divide into one of two groups. In the first group are those who trust in the odds and play the probability game. In the other, those who demand absolute guarantees. Neither group seems fully capable of understanding the other's standpoint.

Let me add my own perspective: Distributed locks are a fallacy. They can be beneficial in decreasing contention, under the assumption that "most of the time, only one actor will be active". However, by themselves, they offer no solid guarantees. The blog post addresses this point by introducing the concept of fencing tokens. These tokens have the potential to provide concrete guarantees, but they require the cooperation of downstream systems for enforcement, which isn't always possible.

I was really surprised to see Antirez argue for the probability approach.

I don't know why you say neither group understands each other when I literally said in my comment that my approach is probabillistic.

Also, how is configuring an infinite timeout (and checking whether the client is actually gone before manually breaking the lock lease) not an absolute guarantee?

I guess it’s an absolute guarantee, but you lose liveness unless there is some other party handling those cases where the client is truly gone
An infinite timeout works in theory, but it is impractical at scale.
Okay but then I really don't know what point you're trying to make. Either an infinite timeout, or non-absolute mutual exclusion guarantee: we have to pick either one. The right choice depends on the workload. There is no perfect solution that lets you have your cake and eat it too.
Redlocks don't use anything proabilistic, not in the sense that the lock may fail to guarantee what it promises at random.
While that may be true, it’s generally not possible to use red locks to build a correct system is mutual exclusion is a strict requirement
what would you define as a correct system?
Elaborate?