| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kgeist 1510 days ago
	At our job someone decided to use a ready-to-use Go library which used Redis for distributed locking. But I found that it was broken by design and completely unreliable, and we had random transient errors stemming from it. It worked OK 99.9% the time, but once in a while we were getting inconsistent state in our application. The description initially made sense and the usage looked simple. It worked by a node creating a value with a TTL, which was used to make the lock auto-expire if a node crashed. If a node found that a value under the same name was already found in Redis, it would block. Since access to Redis is serialized, all such actions were basically atomic. The problem was due to the auto-expire feature. The TTL can expire while your code under the lock is scheduled out due to GC or waiting for I/O. So the lock that you held could be released basically at any point of execution while you were supposedly under the lock. Extending the lock's TTL after every line of code isn't practical and probably prone to race conditions anyway (and the library IIRC didn't provide a way to do it). I read there's a technique called token fencing but it requires additional changes to the way your shared resources are accessed which isn't always possible. I still don't know how to do distributed locks right and there seem to be many broken implementations in the wild.

1 comments

tonyg 1510 days ago

So Redis isn't really a distributed locking system. The locks are all managed by a central, non-distributed server: Redis. But this kind of lock is useful too. One nice approach to handling crashes in a system like this is to use the idea of fate sharing [1]: you upper-bound the lifetime of a held lock by the lifetime of the TCP connection it was taken on. When the connection goes, the lock is auto-released. To support this, Redis would have to have some kind of EXPIRE_ON_DISCONNECT command - I don't know if it does or not.

The idea of fate sharing is very general and useful: you can, for example, introduce reconnectable sessions, and attach shared state to those, which gets you transport-independence and the ability to recover from transport failure.

[1] Clark, David D. “The Design Philosophy of the DARPA Internet Protocols.” ACM SIGCOMM Computer Communication Review 18, no. 4 (August 1988): 106–14. https://doi.org/10.1145/52325.52336.

link

junon 1510 days ago

Redis can be distributed FWIW, it has two clustering modes. It's just that sharded distribution comes with a bunch of caveats.

link

tonyg 1510 days ago

Yes, absolutely! The caveats are relevant to distributed locking, though: sharding would help scale out a locking system horizontally, but each subset of keyspace would still be a non-distributed locking service. Primary-secondary replication doesn't (as far as I can tell!) offer the necessary invariants to act as a locking service - at least, not when employing the straightforward technique GP mentions.

link

junon 1510 days ago

Yes, definitely. :)

link