Hacker News new | ask | show | jobs
by SEMW 3410 days ago
Simple thought experiment: realtime messaging system; servers in multiple AZs; user A is connected to endpoint 1 in AZ 1, user B is connected to endpoint 2 in AZ 2; A publishes a message on a channel that B is subscribed to. Then it is an error for B to not receive it. This makes network partitions (especially partial, asymmetric, or sporadic ones) a nontrivial problem. Of course there are solutions, but it's hardly as simple as "just make your app multi-AZ". Not every app is a bunch of independent boxes serving web pages.
3 comments

Well what is your SLO around message delivery? If it's "a successful response means a message will be delivered at least once and in under X time" then you need to verify that message has been durably committed to multiple machines in all of your availability zones. if it's just that the message is durably committed or the SLO on delivery is long enough, then you can drop the multi-as bit.
I think if your app is a cluster that requires a quorum, 2 AZs just isn't enough - you need to be running in 3 to be tolerant to one going down.
Yep. Non-quorum members can even simply terminate existing connections and refuse new connections from clients, so clients are always either connected to a quorum node or not connected at all (CP, no A)
The point is, if a health check fails or if there is an outage on AZ 2 then the ELB should be scripted to route to AZ 1 only, as well as AZ 3 if it exists.