Can you give morespecifics about what you were running on and what you purchased for your own gear?
I run an environment that scales to around 1,000 EC2 instances daily. Primarily we run C3.2Xlarge and R3.2xlarge for the core of our application.
We have ~12 nodes in our mongo cluster, and havent had a single issue with these nodes.
I occasionally get a zombie (totally hung VM) but thats very infrequent. I was aggressively using spot instances previously, but have switched to all 12-month reservations (We would lose many machines to a spot outage, new machines - more than those on Richess) and the recovery time for our system is 35 minutes (due to the R3 boxes needing to download their in-memory index from other machines) - so our service is degraded in capacity until the relaunch of these machines completes.
[aside: if youre looking to use spot, do two things - over-provision by a factor of 1.8 and spread across zones, and go look into using ClusterK.com for their balancer product]
Anyway, Just curious what was causing "sometimes daily" outages - I can't imagine that this would be due to AWS and not lacking ability of your application to handle instance losses.
Using EC2 here for nearly 2 years and you mention I/O problems and instance outages 2-3 times a week. Which size instances were you running?
I ask because other than the VM security updates, none our instances have these sort of issues and some of them have a VERY long life (not ideal we know). I understand the cost savings and the rest of the reasoning but in my experience EC2 isn't THAT unreliable.
Oh, I know what you're talking about. We too had some instances (actually, a lot of those) that would run for a year with no issues. The problems started around the time you tried to push EC2 instances beyond an "idle, handling some requests just to keep from falling asleep" state. Pushing IO (even with provisioned IOPS) caused random IO stalls, pushing CPU caused REALLY uneven performance, etc.
And the only solution provided by EC2 support was always to buy more instances to keep them cold and happy. The problems with that approach (just to name a few): the cost (for a young startup burning money on idle infrastructure like that is not very wise IMO) and the fact, that the time to design, develop and deploy scale-out approach for each of your backend services is the time you could have spent trying to build your product (again, startup-specific; you'll have to think about across-the-board 100% scalability at some point).
First: I work in Startup BD at AWS (disclosure), but have been a multi-time founder as well. I was under the impression that an AWS architect will sit with you to optimize your infrastructure (Business Support). Did that not happen / or was it not useful? Happy to help in any way I can.
What size instances were you using in EC2 that were having performance degradation, and what kind of specs did the real hardware have that you moved to?
How do you handle spikes in request volume? That is, one of the nice things about working in the cloud with dynamic sizing is that your costs should only be relative to your average load, not necessarily your peak load. Given the size of Swiftype, and the fact that you back tons of individual sites (instead of being your own site), there might be enough variance in the sites you back so that your peak and average load is pretty similar. For a single ecommerce site, though, they might get a huge peak over baseline if they do a big marketing push, for example. In that case, they might have to provision many more physical servers than they usually need to handle that peak. Just wanted to see if this issue came up in your planning.
Honestly, just as it is with many SAAS companies at some scale, we do not need care about any specific customer's traffic anymore. Simply because we get so much traffic already from our existing customers, that none of the new customers could generate enough to cause any significant blip on the radar. If a customer comes to us with some specific requirements (like being able to index 100MM documents with some specific response time guarantees), we build dedicated pieces of infrastructure for them, load-test it all and provide those guarantees. All of the others are placed in their own pools which have enough capacity to handle 3-5x of ALL of our current traffic with no issues, so any single customer would not be able to generate enough load to cause problems.
And, as I mentioned in the article, we could always order new boxes for any of our clusters and get them online within a couple hours, so we are able to scale up pretty quickly if needed.
Did you reserve your instances? Were you using current generation instances (c3, m3, etc). Did you try to take advantage of traffic patterns to scale up and down the number of instances you were running?
We had reserved instanced and regular ones, we did not see any patterns in stability issues between those. Re: instance types - I do not really remember which instances we were using to be honest. And as for the scaling up and down - we had a hard time keeping it all up as it was, we did not want to spend resources trying to make it work with constantly changing node pools (though I understand, that it would push us to building are more robust infrastructure able to handle random node outages, we had a business to build and wanted to focus on the product instead of creating a perfectly scalable application for an early stage application).
Well I ask because reserved instances can significantly reduce the price of ec2 (up to something like 75%). Also just turning off idle instances can save a ton. If you invest the time in doing those things I think you can beat other options and so in that way, cloud infrastructure can be very economical.
It is like many other things involved in running a technology company. Investing in automation can pay off hugely.
The newer instance types are very reliable too (in my experience).
If you were using old generation instances (t1 m1 c1 etc.) your experience would have been VERY different to current generation instances (c3 c4 m3 t2). Did you try network optimized instances with (very) low latency networking?
Other than ec2, what AWS services were you using and how did you migrate those? At minimum, I'd guess you were using elb for load balancing, sqs for queues and elastic cache for redis.
We are still a loyal customer for some of their services. For example, we still use S3 for off-site backups and Route53 is still our primary DNS provider.
For load balancing we have moved to a Route53 (health checks and round-robin) + a group of nginx+haproxy+lua-based frontend boxes.
Everything else was either built in-house or used open-source components and wasn't really tied to EC2 infrastructure.
Did you seriously consider a traditional colo or other vendors like SoftLayer (e.g., Rackspace)? It seems like at some point your reasoning here will apply to a colo if you grow bigger.
Colo - that's an option I'll try to stay away from as long as it is humanly possible. All of my experiences with colo hardware caused a lifetime of pain so that I'm happy to be paying SL a premium for their outstanding services (I'm a huge fan of Softlayer as you have probably guessed).
Re: Rackspace and other providers – based on my real-life practical experience with a few of the largest providers in the States, SL quality of services and their provisioning speed are miles away from competitors could offer. So it was a no-brainer to go with SL and I'm happy we did.
I run an environment that scales to around 1,000 EC2 instances daily. Primarily we run C3.2Xlarge and R3.2xlarge for the core of our application.
We have ~12 nodes in our mongo cluster, and havent had a single issue with these nodes.
I occasionally get a zombie (totally hung VM) but thats very infrequent. I was aggressively using spot instances previously, but have switched to all 12-month reservations (We would lose many machines to a spot outage, new machines - more than those on Richess) and the recovery time for our system is 35 minutes (due to the R3 boxes needing to download their in-memory index from other machines) - so our service is degraded in capacity until the relaunch of these machines completes.
[aside: if youre looking to use spot, do two things - over-provision by a factor of 1.8 and spread across zones, and go look into using ClusterK.com for their balancer product]
Anyway, Just curious what was causing "sometimes daily" outages - I can't imagine that this would be due to AWS and not lacking ability of your application to handle instance losses.