| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by parasubvert 5614 days ago

Generally speaking this is the sort of thing that people warn about when they say "if you want to run on a cloud, you need to design your application for a cloud". Meaning, you can't presume your infrastructure is dedicated and carries similar MTBFs of (say) an enterprise hard drive, which upwards of 1 million hours.

Amazon provides plenty of opportunities to mitigate for this, such as providing multiple availability zones. Reddit, if you read the original blog post, wasn't designed for that - it was designed for a single data centre.

OTOH, the variability of EBS performance is true, and frustrating. If you do a RAID0 stripe across 4 drives, you can expect around sustained 100 MB/sec in performance modulo hiccups that can bring it down by a factor of 5. On a compute cluster instance (cc1.4xlarge) it's more like up to 300 MB/sec if you go up to 8 drives, since they provision more network bandwidth and seem to be able to cordon it off better with a placement group.

1 comments

khafra 5614 days ago

> modulo hiccups that can bring it down by a factor of 5.

The comments on reddit indicated hiccups more on a factor of 10x and, sometimes, 100x.

Either way, the issue is that the more drives you add to your RAID0, the more often one of those drives experiences a "hiccup," and kills the performance of the entire volume.

link

parasubvert 5613 days ago

It's not clear this was a single volume problem so much as an issue with one or more network switches in that availability zone (if you look at the AWS service health notes for that date).

Even in your own data centre, if your FC fabric goes wonky, your whole SAN is hosed.

link