| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mikebo 5144 days ago
	Worst part of this outage: paying for a multi-az RDS instance and having failover totally, completely, fail.

5 comments

keithnoizu 5144 days ago

I'm paying like 2,300 a month and even something basic like failover isnt working. I'm not happy.

link

shiftpgdn 5144 days ago

At $2300/month you could redundantly colo or lease VERY powerful servers in 3-4 data centers around the country.

link

dialtone 5144 days ago

Except when you have to factor in all the plane flights to replace broken HDD. And the risk of not making it in time for when it breaks.

link

shiftpgdn 5144 days ago

Most colo facilities let you buy hands on time through their techs or include a small amount per month for things like hard drive/ram swaps.

link

rdl 5144 days ago

Yeah, I don't think I'd go with less than RAID-6 (or full system redundancy plus 1 drive redundancy in each). Rebuilds just take too long, even with an in-chassis spare on RAID5.

Unfortunately Areca is really the only controller I've found which is well supported and does RAID6 fast.

link

bgentry 5144 days ago

would those be managed at that price? because it's a hell of a lot more expensive when you factor in the cost of devops to make sure it stays working and fails over properly.

link

keithnoizu 5143 days ago

Poor inherited architecture, working to scale out greatnonprofits.org horizontally but it will be a while before we get there.

  I have nothing against colo but I don't really have time to run around the country checking on servers.

link

nadahalli 5144 days ago

I feel for you :-(

Amazon is not cheap, and they have failed way too many times in recent memory.

But the api, oh the api - it's crack, and I can't live without it.

link

RegEx 5144 days ago

I know what you mean. I have a lot of issues with AWS, but the AWS console is exactly what my manager needs so he can do things himself. Simple things such as AWS load balancing fails when we get any decent amount of traffic.

link

werkshy 5144 days ago

Luckily my RDS wasn't affected, but ELB merrily sent traffic to the affected zone for 30 minutes. (Either that or part of the ELB was in the affected zone and was not removed from rotation.)

We pay a lot to stay multi-AZ and it seems Amazon keep finding ways to show us their single points of failures.

link

gouranga 5144 days ago

That sucks badly.

Similar thing happened to me a while ago with a vendor. When your management team summons you to ask why the hell their site is down, you can't point fingers at the vendor if their marketing literature says it doesn't go down.

Sticky situation.

link

TazeTSchnitzel 5144 days ago

Can't you tell management that it isn't as reliable as they claim?

link

gouranga 5144 days ago

I did. Unfortunately in the financial services industry, believing it means taking responsibility for it.

link

its_so_on 5144 days ago

If you don't host your data in several alternative dimensions so that the same events wouldn't transpire in all of them - why not assume you'll encounter the occasional outage?

link

gouranga 5143 days ago

If only people understood that fact. Unfortunately few do.

link

malachismith 5144 days ago

Do we all agree that we are completely over AWS-EAST now? It's NOT worth the cost savings.

link

rabbitfang 5144 days ago

The Oregon (us-west-2) region is the same price as the Virginia (us-east-1) region.

link

res0nat0r 5144 days ago

Did/does your standby replica in another AZ have any instance notifications stating there is a failure? The outage report claims there were just EBS problems in only one AZ.

link

mikebo 5144 days ago

No, nothing unusual with our standby replica. It's not even clear if it was the standby or our primary that was in the affected AZ.

Multi-AZ RDS does synchronous replication to the standby instance -- I'm guessing something broke in there. Hopefully AWS will update with a post mortem as they usually do. Lots of frustrated MultiAZ RDS customers on their forums.

link

res0nat0r 5144 days ago

Yeah unfortunately it looks to be an EBS problem and if your underlying EBS volume housing your primary DB instance takes a dump then that is unfortunately going to cause replication to fall over too

link

mikebo 5144 days ago

Multi-AZ RDS deployment is supposed to protect you from that though. That's why it's 2x the price. We should have failed over to a different AZ w/o EBS issues.

link

res0nat0r 5144 days ago

If your source EBS volume is horked then you aren't going to be replicating any data to your backup host while the EBS volume is messed up (since your source data is unavailable). EBS volumes also don't cross/failover between AZ boundaries.

Maybe there was something bad with your replication server before the outage? It's hard to guess without knowing exactly what was happening at the time...

link

mikebo 5144 days ago

I don't think you're familiar with how Multi AZ RDS works: http://aws.amazon.com/rds/faqs/#36

The whole point is to protect you from problems in one AZ by keeping a hot standby in another AZ. It doesn't matter whether it's due to EBS, power, etc. This is one of the primary reason to use RDS instead of running MySQL yourself on an instance.

link