| > Convincing a whole generation of programmers that distributed lock are a feasible solution. I too hate this. Not just because the edge cases exist, but also because of the related property: it makes the system very hard to reason about. Questions that should be simple become complicated. What happens when the distributed locking system is down? What happens when we reboot all the nodes at once? What if they don't come down at exactly the same time and there's leader churn for like 2 minutes? Etc, etc. Those questions should be fairly simple, but become something where a senior dev is having to trace codepaths and draw on a whiteboard to figure it out. It's not even enough to understand how a single node works in-depth, they have to figure out how this node works but also how this node's state might impact another node's. All of this is much simpler in leaderless systems (where the leader system is replaced with idempotency or a scheduler or something else). I very strongly prefer avoiding leader systems; it's a method of last resort when literally nothing else will work. I would much rather scale a SQL database to support the queries for idempotency than deal with a leader system. I've never seen an idempotent system switch to a leader system, but I've sure seen the reverse a few times. |
> I too hate this. Not just because the edge cases exist, but also because of the related property: it makes the system very hard to reason about.
I think this is a huge problem with the way we’re developing software now. Distributed systems are extremely difficult for a lot of reasons, yet it’s often or first choice when developing even small systems!
At $COMPANY we have hundreds of lambdas, DocumentDB (btw, that is hell in case you’re considering it) and other cloud storage and queuing components. On call and bugs basically are quests in finding some corner case race condition/timing problem, read after write assumption etc.
I’m ashamed to say, we have reads wrapped in retry loops everywhere.
The whole thing could have been a Rails app with a fraction of the team size and a massive increase in reliability and easier to reason about/better time delivering features.
You could say we’re doing it wrong, and you’d probably be partly right for sure, but I’ve done consulting for a decade at dozens of other places and it always seems like this.