You guys might have answered this in one of your AMAs/blog posts (or was it raldi who commented?), but what options can reddit resort to should this stuff happen again to this degree of severity?
We're moving away from the EBS product altogether. The hard part is dealing with the master databases. Normally I'd have a master database with a built in raid-10, but I can't do that on EC2, so I have to come up with another option.
So I guess that is the long way of saying that hopefully it won't happen again.
I do not believe you could be effective by moving away from EBS, you know without giving up quite a bit.
Doing things the right way with EC2 means using EBS. It's the brake caliper to the rotor. Sure you could have drum brakes but they're not nearly anywhere effective as they quickly get heat soaked. I'm referring to S3.
One should trust ephemeral storage. Your instance can go down at anytime. Write speeds to S3 are not nearly as fast as ephemeral or EBS arrays (raid).
Hate to say it, but If one cannot trust EBS then what the heck are 'we' doing on EC2... EBS quality should be priority one, otherwise we're all building Skyscrapers on foam foundations of candy cane rebar.
I can't say whether much has changed within the last year, but when I worked at FathomDB we had serious issues with EBS. You couldn't trust it. Odd things would happen like disks getting stuck in a reattaching state for days and disks having poor performance.
It still has to be stored somewhere though right? If it's EBS you've just made yourself a complicated solution that will eventually fail all over again. No?
We have had a lot of success stabilizing EBS by creating mdadm arrays out of lots of smaller EBS volumes. There is minimal additional costs and you can get better performance, stability, and protection (RAID 5, 6).
Gluster makes an OSS distributed filesystem that runs across availability zones, our AMI (not OSS) builds multiple RAID arrays on each instance then spreads the filesystem across instances in multiple AZs. Send me an email if you want to chat.
So I guess that is the long way of saying that hopefully it won't happen again.