Hacker News new | ask | show | jobs
by gregholmberg 5279 days ago
EBS-backed instances run nice and fast, until they don't.

EBS root volume instances will run just fine, until the OS needs some data from the root volume and can't get it.

Many identical copies of your EBS blocks are stored across clusters, with quorum voting. Sometimes the clusters are all fast. Then your instance will run fast. If one cluster degrades, the good clusters will vote it down, the fast clusters will answer quickly, and your instance will still run fast.

If several clusters are running slow, and there are not enough good clusters to override the slow clusters, then your instance must wait for the slow clusters to clear some backlogged I/O. You can see these kind of traffic jams in the CloudWatch monitoring tool for the EBS volume: watch the read/write latency.

If new I/O requests arrive at the block storage clusters before old requests clear, your root volume device driver will appear to be "stuck". You will not be able to complete any more I/O on the device.

If your OS wanted a memory page from your swap device, and your swap device was behind a latency-choked curtain of multiply redirected EBS blocks, your instance may now be unrecoverable without a reboot.

Although the EBS volume is still attached to your instance, and all the clusters are still online, your I/O request never returns because the complex system designed to fulfill the request has collapsed into a state of congestion that it cannot easily recover from. To clear the problem in April 2011, AWS sysadmins drove to other data centers to unrack clean cluster systems. By adding EBS capacity at the chokepoint, they broke the logjam.

Generally speaking, if your kernel enters uninterruptible code, and the resource it wanted cannot be reached, your OS is going to hang, hard.

It is a good idea to keep your operating system -- its kernel, its libraries, its application code -- as close to the running system as possible. For Amazon EC2, this (arguably) means using instance-store (aka ephemeral disk) storage.