| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by franciskim 3457 days ago
	Yeah, I was about to mention multi-AZ load balancing too. Can't believe a real time messaging platform doesn't have that.

1 comments

matt_oriordan 3457 days ago

Well it's not that's simple. We do run in multiple availability zones in every region. But if the connectivity between them is partially working, which it was, and shared service from Amazon itself aren't working fully from every instance, ou have a huge mess to contend with where the cluster consensus cannot be formed. So in cases like this we did what we should have done and routed traffic away form a network that was unreliable and partly partitioned. The point for us was not that this availability zone went down at all. It was that Amazon throughout claimed everything was operating normally for hours when this was very far from the truth.

link

lostcolony 3457 days ago

So one availability zone went down, not the region, which Amazon indicated on their status page (which is clearly set up to predominantly display outages of the region, not of a single AZ within the region), and because of how your system was set up to be dependent on each AZ being operational and networks to not partition, it caused issues.

I get that having issues on AWS is irritating; I exist in that ecosystem too. But...I really can't fault them for this, or claim that they're lying. AWS says to not rely on any one AZ being up and/or reachable, and yet you did. And the fact it caused problems for you means you want them saying the entire region is down. Why? They make regions be fault tolerant by having multiple AZs; they guarantee reliability at the region level, not at the AZ level, and that's what the status page is intended to track.

Now, I can see wanting a clear status page per AZ, rather than just a blue 'i'. That's a valid request. But -request- that, don't claim that they're lying. You're being antagonistic despite them doing everything they've promised, and their status page being correct (just not using the colors you would like because they view severity differently than you).

link

apeace 3457 days ago

I think the two suggestions I made are more reasonable than claiming AWS "lied". It's understandable that your customer would be confused seeing the blue "i" instead of something yellow. But that doesn't mean they claimed "everything was operating normally".

In terms of your cluster, it takes a lot of testing and tweaking to ensure your cluster can reach consensus during partial/sporadic partitions. But it can and should be done if you need high availability (e.g. take nodes out of the cluster if they keep disconnecting, until they have a stable connection for X minutes).

link