Hacker News new | ask | show | jobs
by parasubvert 5566 days ago
Generally speaking this is the sort of thing that people warn about when they say "if you want to run on a cloud, you need to design your application for a cloud". Meaning, you can't presume your infrastructure is dedicated and carries similar MTBFs of (say) an enterprise hard drive, which upwards of 1 million hours.

Amazon provides plenty of opportunities to mitigate for this, such as providing multiple availability zones. Reddit, if you read the original blog post, wasn't designed for that - it was designed for a single data centre.

OTOH, the variability of EBS performance is true, and frustrating. If you do a RAID0 stripe across 4 drives, you can expect around sustained 100 MB/sec in performance modulo hiccups that can bring it down by a factor of 5. On a compute cluster instance (cc1.4xlarge) it's more like up to 300 MB/sec if you go up to 8 drives, since they provision more network bandwidth and seem to be able to cordon it off better with a placement group.

1 comments

> modulo hiccups that can bring it down by a factor of 5.

The comments on reddit indicated hiccups more on a factor of 10x and, sometimes, 100x.

Either way, the issue is that the more drives you add to your RAID0, the more often one of those drives experiences a "hiccup," and kills the performance of the entire volume.

It's not clear this was a single volume problem so much as an issue with one or more network switches in that availability zone (if you look at the AWS service health notes for that date).

Even in your own data centre, if your FC fabric goes wonky, your whole SAN is hosed.