| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by everforward 654 days ago

> Convincing a whole generation of programmers that distributed lock are a feasible solution.

I too hate this. Not just because the edge cases exist, but also because of the related property: it makes the system very hard to reason about.

Questions that should be simple become complicated. What happens when the distributed locking system is down? What happens when we reboot all the nodes at once? What if they don't come down at exactly the same time and there's leader churn for like 2 minutes? Etc, etc.

Those questions should be fairly simple, but become something where a senior dev is having to trace codepaths and draw on a whiteboard to figure it out. It's not even enough to understand how a single node works in-depth, they have to figure out how this node works but also how this node's state might impact another node's.

All of this is much simpler in leaderless systems (where the leader system is replaced with idempotency or a scheduler or something else).

I very strongly prefer avoiding leader systems; it's a method of last resort when literally nothing else will work. I would much rather scale a SQL database to support the queries for idempotency than deal with a leader system.

I've never seen an idempotent system switch to a leader system, but I've sure seen the reverse a few times.

1 comments

anothername12 654 days ago

>> Convincing a whole generation of programmers that distributed lock are a feasible solution.

> I too hate this. Not just because the edge cases exist, but also because of the related property: it makes the system very hard to reason about.

I think this is a huge problem with the way we’re developing software now. Distributed systems are extremely difficult for a lot of reasons, yet it’s often or first choice when developing even small systems!

At $COMPANY we have hundreds of lambdas, DocumentDB (btw, that is hell in case you’re considering it) and other cloud storage and queuing components. On call and bugs basically are quests in finding some corner case race condition/timing problem, read after write assumption etc.

I’m ashamed to say, we have reads wrapped in retry loops everywhere.

The whole thing could have been a Rails app with a fraction of the team size and a massive increase in reliability and easier to reason about/better time delivering features.

You could say we’re doing it wrong, and you’d probably be partly right for sure, but I’ve done consulting for a decade at dozens of other places and it always seems like this.

link

everforward 654 days ago

> You could say we’re doing it wrong, and you’d probably be partly right for sure, but I’ve done consulting for a decade at dozens of other places and it always seems like this.

The older I get, the more I think this is a result of Conway's law and that a lot of this architectural cruft stems from designing systems around communication boundaries rather than things that make technical sense.

Monolithic apps like Rails only happen under a single team or teams that are so tightly coupled people wonder whether they should just merge.

Distributed apps are very loosely coupled, so it's what you would expect to get from two teams that are far apart on the org chart.

Anecdotally, it mirrors what I've seen in practice. Closely related teams trust each other and are willing to make a monolith under an assumption that their partner team won't make it a mess. Distantly related teams play games around ensuring that their portion is loosely coupled enough that it can have its own due dates, reliability, etc.

Queues are the king of distantly coupled systems. A team's part of a queue-based app can be declared "done" before the rest of it is even stood up. "We're dumping stuff into the queue, they just need to consume it" or the inverse "we're consuming, they just need to produce". Both sides of the queue are basically blind to each other. That's not to say that all queues are bad, but I have seen a fair few queues that existed basically just to create an ownership boundary.

I once saw an app that did bidirectional RPC over message queues because one team didn't believe the other could/would do retries, on an app that handled single digit QPS. It still boggles my mind that they thought it was easier to invent a paradigm to match responses to requests than it was to remind the other team to do retries, or write them a library with retries built in, or just participate in bleeping code reviews.

link

anothername12 654 days ago

> once saw an app that did bidirectional RPC over message queues

Haha I've seen this anti-pattern too (although I think it's in the enterprise patterns book??). It would bring production to a grinding halt every night. Another engineer and I stayed up all night and replaced it with simple REST API.

link

icedchai 654 days ago

I once saw a REST API built with bidirectional queues. There was a “REST” server that converted HTTP to some weird custom format and an “app” server with “business logic”, with tons of queues in between. It was massively over complicated and never made it to production. I won’t even describe what the database looked like.

link

icedchai 654 days ago

I see the same. All this complexity to handle a few requests/second... but at least we can say it's "cloud native."

link

giovannibonetti 654 days ago

Same thing where I work now. Many experienced developers waste a huge chunk of their time trying to wrap their heads around their Django micro services communication patterns and edge cases. Much more complex than an equivalent Rails monolith, even though Ruby and Rails both have their issues and could be replaced by more modern tech in 2024.

link