Hacker News new | ask | show | jobs
by StabbyCutyou 3477 days ago
So, specifically in the context of a distributed queue (which, if we're talking about using MS arch + queueing due to scale concerns, you really need some kind of distribution to the message queue imo), these things get a lot harder.

Just because TCP provides resilience doesn't mean that you're perfectly defended against all kinds of issues here.

I'm mostly talking about things in the category of bugs in the software layer using TCP (the drivers, the consumers/publishers, race conditions, batching messages, not receiving ACKs, etc). There are a lot of little things that can go wrong.

In terms of partitions, look up a series by a guy named Aphyr called "Jepsen". It goes over the CAP theorem as it applies to distributed datastores and queues. His examples and tests will demonstrate the concepts behind partitioning better than I can explain over a HN comment :)

And yes, these types of failures are implicit everywhere - but every additional layer you add, every hop in the chain, every interaction added to the request flow increases the surface area for problems. Especially once you push high scale with hundreds of nodes, become nic or cpu bound, etc etc.

There is a lot to unpack here, and it's not as simple as it seems on the face of it.