| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jedberg 5573 days ago
	We're currently in the process of replacing every one of our hosts with new OS versions. As we do this we are in fact going to the EBS based instances. Those instances actually show the same problems, but they aren't too bad, because once you boot them, you don't need the root vol that much (that's what the instance storage is for).

4 comments

jjm 5573 days ago

Some Qs:

Q1. I still don't get the use case for db storage on ephemeral storage.

Q2. If EBS is the problem why are you migrating to S3 backed EBS boot vols? The problem with this is still the time in between snapshots even though it will be shortened.

Some Comments: It will only be a matter of time before S3 disks and hardware start dying like EBS...en masse

I talked with Ketralnis several year ago and know how many VMs you were running back then. Pretty sure your not too far off from that count even today (even if 2x).

You can still virtualize on a good set of dedicated hardware to emulate your current 'network environment' to get you up and running in the near term _asap_. Obviously you'd build out of that vm environment (with your load) as the days go by. Seriously look into a parallel switch over though.

If EBS is in fact a huge issue as has been shown, you really may need to start migrating off unless you want dedicated employees monitoring system health on AWS. Eventually if problems continue that is what will happen, with no time left to even develop automation... And why automate on a pile of instability?

Don't forget that the more VMs you add with this high failure rate increases soft management costs and will eventually eat into your development time...

I don't work for Rackspace (I think they're quite expensive), but you guys might benefit from this level of care to focus on the real issues.

link

jedberg 5571 days ago

> Q1. I still don't get the use case for db storage on ephemeral storage.

We're still not sure either, so we're investigating to see if it makes sense. One possible option will be to have the master on ephemeral disk with a hot backup on EBS so there is no data loss.

Another option is use ephemeral for the master and all but one slave, so we got hot backups without a slowdown.

Still need to look into it more.

The one that we are doing ephemeral right now is Cassandra with continuous snapshots to EBS. Everything in there can be recalculated, and with an RF of 3, if we lose one node we can run a repair.

> Q2. If EBS is the problem why are you migrating to S3 backed EBS boot vols? The problem with this is still the time in between snapshots even though it will be shortened.

They are just easier to use. The root volume is rarely accessed after it is booted, so the EBS slowdowns aren't really a problem in that case.

> Some Comments: It will only be a matter of time before S3 disks and hardware start dying like EBS...en masse

I don't think so. It is a totally different product built by a totally different team with a different philosophy. S3 was build for durability above all else.

In response to the rest of your comments, you are absolutely right, there are other options. We will certainly be investigating them.

link

jjm 5572 days ago

I meant to say several months ago, not years.

link

ktsmith 5573 days ago

Thanks for the follow up jedberg, I was just guessing based on what has been publicly stated in the various blog posts over the last year or two. I used the same process for my own s3 -> ebs boot volume migration. That took a few weeks and I didn't have that many instances to migrate in the first place. Given the large number of instances reddit uses and the surprisingly small staff and one would reasonably expect that the migration would not be done.

link

gregburek 5573 days ago

Thanks for the extra info! I'm doing a lot of work now with python on EC2 and the reddit write ups + presentations have been a huge help. Thanks again.

link

jedberg 5573 days ago

> python on EC2

#1 tip: Don't use threading. Python threading + EC2 will not work well. Instead rely on the OS doing the task switching and run multiple copies.

If you want more info, I did a talk at Pycon about this and other things: redd.it/b5jyy

link

gregburek 5573 days ago

Duly noted. I started with this talk and have been using it as a guide to scaling edge cases with python as well as AWS. I thought raid10 was overkill before I started digging into the postgres/EBS mess, but now it seem almost routine enough that Amazon should have it as a configuration option.

link

WALoeIII 5573 days ago

Did you get stuck in the Fedora 8 trap as well? It was the 'starter' in 2008/2009 and it took us two years to get off it.

link

jedberg 5571 days ago

Ubuntu 8.10 for us.

link