Fixing holes in EC2 reliability

Y	Hacker News new \| ask \| show \| jobs

	Fixing holes in EC2 reliability (mytechgossips.com)
	29 points by buddhika 5326 days ago

3 comments

brettweaverio 5326 days ago

Given my experience, you are advocating way to much reliance on EBS. The nature of block storage makes it a poor choice for cloud attached storage.

Apps should be designed for redeployment quickly via configuraiton management. Persistent data should be stored to S3 or RDS, only use EBS as a last resort.

Moving from a VPS to EC2 is more than a fork lift migration, but should be viewed as an application redesign.

link

bermanoid 5326 days ago

Can you elaborate on what's so wrong with EBS? Are you thinking more about performance, or reliability?

I'm wondering in particular what you'd suggest in cases where RDS is not an option, if you're running your own MongoDB server or something like that. I don't think S3 is really an option there, is it?

Or would you tend to do everything on the instance store and periodically do a manual snapshot to S3?

link

piramida 5326 days ago

It is not mentioned in the docs, but not only EBS performance is unreliable, but it is also not uncommon to get a "stuck" EBS volume, where no I/O operations go through at all. So you have to a) monitor and be ready to take action on "dead" hosts stuck on IO and b) not completely rely on EBS as perfect (it is possible to recover data, in most cases, but it is much easier if you can just create a new one and kill the misbehaving volume, automatically).

So in case of running custom data-intensive setup you could: 1. create a RAID out of several EBS volumes, maintain that - predictable speed, much better reliability 2. easier if you need a small scale solution - regular ebs snapshots and be ready to lose some inbetween snapshot data. Also, I think performance of disk is not a big issue in MongoDB case, so case 2 might be enough for your application. EBS failures are rare, afterall, so unless your data is critical and there is no way to do replication, snapshots would do.

link

brettweaverio 5326 days ago

EBS performance is notoriously variable by orders of magnitude. Amazon is working on this, however my personal belief is that using block storage creates an undesirable link between your storage and compute layers. Using S3, RDS or another NoSQL solution is a far more modular approach.

That being said, all apps are different and some may need block storage. I'm not saying it should not be used, however when making that decision the relation and hard dependencies created should be understood.

link

obfuscate 5326 days ago

Er, RDS is (a) SQL (b) backed by EBS.

link

brettweaverio 5326 days ago

Understood, however AWS manages the association to block level storage, as well as the performance tuning. You can still operate with ephemeral compute decoupled from storage.

link

sauravc 5326 days ago

EBS volumes are analogous to hard drives. They're reliable... until they fail. So keep backups. This is easily done with EBS snapshots, (which are persisted to S3).

link

justincormack 5326 days ago

tl;dr no one told him Amazon EC2 is not a VPS provider. Lots of people seem to assume it is. Instances are supposed to die, thats a feature not a bug.

link

mchanson 5326 days ago

This is the most important lesson here. I've heard too many stories that end badly and I've learned this the hard way myself.

link

rhizome 5326 days ago

Ooops sorry, downvoted by accident.

I have a question though: how do people get bitten by this lack of instance persistence?

link

bmelton 5326 days ago

In many cases, there are people who don't do their homework and set up regular, VPS-like web servers on EC2. What happens then, is that they have a real, established website that, weeks, months or years down the road, eventually gets rebooted, and disappears.

The EC2 instances boot to 'boot images', basically. Most of the images are like CDs, and contain just enough to get you ready to install your webserver, database, yadda yadda.

You can configure your image how you like, and then create a new 'image', which will be what your machine looks like after a reboot, but unless you use a persistent data store or external storage of some sort, you can't add new blog posts and expect them to be there after a reboot.

There are easy ways around it, and in fact, are best practices for application design, but compared to the normal shared hosting or VPS configurations that most people know, it is completely different.

link

rhizome 5326 days ago

Oh hah, I wasn't even thinking like that.

Yep, it's a good idea to save your work.

link

blurbytree 5326 days ago

The other lesson I'd add is that if an instance dies and can't be rebooted, you'll need to stop/start it, which will result in an IP address change.

Best to have elastic IPs at the ready for that instance, or be prepared to deal with the IP changes.

link

jl6 5326 days ago

What's wrong with booting instances from EBS-backed images?

link

gregholmberg 5326 days ago

EBS-backed instances run nice and fast, until they don't.

EBS root volume instances will run just fine, until the OS needs some data from the root volume and can't get it.

Many identical copies of your EBS blocks are stored across clusters, with quorum voting. Sometimes the clusters are all fast. Then your instance will run fast. If one cluster degrades, the good clusters will vote it down, the fast clusters will answer quickly, and your instance will still run fast.

If several clusters are running slow, and there are not enough good clusters to override the slow clusters, then your instance must wait for the slow clusters to clear some backlogged I/O. You can see these kind of traffic jams in the CloudWatch monitoring tool for the EBS volume: watch the read/write latency.

If new I/O requests arrive at the block storage clusters before old requests clear, your root volume device driver will appear to be "stuck". You will not be able to complete any more I/O on the device.

If your OS wanted a memory page from your swap device, and your swap device was behind a latency-choked curtain of multiply redirected EBS blocks, your instance may now be unrecoverable without a reboot.

Although the EBS volume is still attached to your instance, and all the clusters are still online, your I/O request never returns because the complex system designed to fulfill the request has collapsed into a state of congestion that it cannot easily recover from. To clear the problem in April 2011, AWS sysadmins drove to other data centers to unrack clean cluster systems. By adding EBS capacity at the chokepoint, they broke the logjam.

Generally speaking, if your kernel enters uninterruptible code, and the resource it wanted cannot be reached, your OS is going to hang, hard.

It is a good idea to keep your operating system -- its kernel, its libraries, its application code -- as close to the running system as possible. For Amazon EC2, this (arguably) means using instance-store (aka ephemeral disk) storage.

link

Maxious 5326 days ago

Performance. http://orion.heroku.com/past/2009/7/29/io_performance_on_ebs...

link