|
|
|
|
|
by kgeist
1510 days ago
|
|
At our job someone decided to use a ready-to-use Go library which used Redis for distributed locking. But I found that it was broken by design and completely unreliable, and we had random transient errors stemming from it. It worked OK 99.9% the time, but once in a while we were getting inconsistent state in our application. The description initially made sense and the usage looked simple. It worked by a node creating a value with a TTL, which was used to make the lock auto-expire if a node crashed. If a node found that a value under the same name was already found in Redis, it would block. Since access to Redis is serialized, all such actions were basically atomic. The problem was due to the auto-expire feature. The TTL can expire while your code under the lock is scheduled out due to GC or waiting for I/O. So the lock that you held could be released basically at any point of execution while you were supposedly under the lock. Extending the lock's TTL after every line of code isn't practical and probably prone to race conditions anyway (and the library IIRC didn't provide a way to do it). I read there's a technique called token fencing but it requires additional changes to the way your shared resources are accessed which isn't always possible. I still don't know how to do distributed locks right and there seem to be many broken implementations in the wild. |
|
The idea of fate sharing is very general and useful: you can, for example, introduce reconnectable sessions, and attach shared state to those, which gets you transport-independence and the ability to recover from transport failure.
[1] Clark, David D. “The Design Philosophy of the DARPA Internet Protocols.” ACM SIGCOMM Computer Communication Review 18, no. 4 (August 1988): 106–14. https://doi.org/10.1145/52325.52336.