Hacker News new | ask | show | jobs
by rdoherty 4128 days ago
I agree. I read this post and was shocked at the amount of planning, process, man-hours, hardware issues and other problems that come with hardware. I've worked at places with ~500 EC2 machines in a dozen autoscaling groups across 3 AZs with many ELBs, databases, SQS queues and other AWS infrastructure and never had to deal with anything like this when upgrading.

Upgrading hardware in EC2 is as simple as changing a launch configuration and updating an auto-scaling group. Maybe an hour of my time to update configs, verify and deploy. Updating something like a database or caching servers is more work for sure, but with 0 time needed to get to the DC, unpack, rack and configure servers you do save time with 'the cloud'.

I get that you do pay more for EC2 instances, especially if you keep hardware for 4 years. But AWS prices drop every year or two along with (generally) faster versions of software so your overall costs do drop.

How many ops employees would you need for a fleet of 500 servers in a datacenter? We managed it all with 4 people with AWS.

6 comments

The Stack Exchange philosophy is that because they can buy truly mega hardware (each one of those two blade chassis they bought has 72 cores and 1.4TB of RAM, remember!), they don't need those 500 servers to start with. Plus the hardware is an asset and you get to depreciate it.

Everywhere I've ever worked we've had the "big spreadsheet" of projected cloud costs, projected ops costs, and hardware costs. In general the "scale horizontally" philosophy will favor the cloud while the "scale vertical" philosophy still seems to favor owned hardware in local datacenters. Which is superior is a crazy, long-standing debate with no clear answer.

The biggest cost of using amazon isn't the hardware, it's the markup on traffic (if you are a dynamic site.)
Can you elaborate? I thought the answer to that question was to scale up if you can, because its much simpler and therefore cheaper. Similar to how you don't give up ACID unless the scale you're working at doesn't permit it anymore.
There's never really a "one size fits all" answer, which is why it's a long-running debate and depends heavily on the product.

Scaling horizontally can let you use smaller, cheaper hardware on average and burst to higher capacity more easily if you need to, at the expense of a lot of complexity. It also (done right, which is rare) tends to gain you a greater degree of fault tolerance, since hardware instances become rapidly-replaceable commodities.

Most web apps have spiky but relatively predictable load. For example, a typical enterprise SaaS startup gets more traffic during work hours than on weekends. For these companies the complexity of developing a horizontally scaled architecture can be offset by the decreased cost of buying really big machines for peak load and then scaling back to a couple small instances for periods of below-average load.

That's (ostensibly) why AWS exists in the first place: Amazon had to buy a lot of peak capacity for Black Friday and Christmas and found it going unused the rest of the year. They never meant to sell their excess capacity, but they realized the tools that they built to dynamically scale their infrastructure were valuable to others.

Plus, a lot of work is offline data analytics, ETL, and so on. It's very cost effective to scale these workloads horizontally on-demand - spin up extra workers to run your reporting each hour/night and keep costs down the rest of the time when you don't need the capacity.

On the flip side, companies like Stack Exchange and Basecamp have high, relatively stable traffic worldwide. For companies like this it makes more sense to scale vertically - if they were in the cloud, they would never scale down or shut down their instances anyway.

Personally, I agree that horizontal scalability is oversold and most people can, indeed, scale up instead of out. However, a lot of plenty smart people disagree with me and have valid reasons to scale horizontally, too.

> a typical enterprise SaaS startup gets more traffic during work hours than on weekends.

You still need to budget what you can get if renting dedicated hardware vs. renting virtual machines. For eg. a Dual Xeon X5670 machine w/ 96GB RAM and 4x480GB SSD can be had for $249 per month (just something random I found for demo purposes). Even if you do a reserved instance for a year on EC2, you can get a m3.2xlarge for this kind of money, and that's only 30GB RAM and 2x80GB SSD.

It might worth it to rent this sort of iron instead of spinning up and down EC2 instances especially if you can reasonably buy a large enough machine to cut a lot of headaches arising from distributed computing. The right tool for the job.

Owning hardware is again a different bag of hurt.

> I thought the answer to that question was to scale up if you can, because its much simpler and therefore cheaper

✱cough✱ SQL Server licensing fees ✱cough✱

That's a good point that I overlooked in my post - I'm sure this is a huge consideration for Stack Exchange.

For what it's worth Basecamp also evangelize the "scale up" approach and they're on an open-source stack.

How many ops employees would you need for a fleet of 500 servers in a datacenter? We managed it all with 4 people with AWS.

This could be a false dichotomy. Just because a service with AWS uses so many servers doesn't mean a more monolithic system would need as many.

We did talks with one of our competitors (before they were a competitor). We mentioned that we ran our infrastructure on 4 large VM hosts (with a light density of 3-4 VMs per host). They were shocked. They were currently running over a hundred EC2 instances, with the relevant satellite services. They literally could not believe that we could provide a comparable service without reliance on something like AWS.

It's amazing what an be done with the right knowledge. In our case, myself and one of my coworkers maintain our infrastructure at something like 4-6 hours per week total (mostly patching, reviewing logs, etc). We both have previous networking, hardware, and software experience. When we do major upgrades (about every 2 years), it takes one of us about a week to source the hardware, get it loaded in to the rack, and turned on. Then we migrate guests over and we're done. This doesn't even get in to the cost savings on running on our own hardware vs AWS pricing.

We run about a hundred servers (soon to be lots more) with a part time staff of 3 (as in, we all do dev work most of the time). It used to be mostly me for ages, but we got big enough that I got promoted out of most of the day-to-day stuff.

All own hardware, and having just had a reboot on Softlayer's schedule to fix the Xen issue for a separate project we're running on their gear - being able to schedule your own maintenance windows is so much nicer. We spend less time dealing with problems on our own hardware than we do dealing with cloud providers having issues.

At my previous company, I built a dedicated hardware system that consistently delivered a sub-second response time at a cost of about a third of what it would cost to host on AWS.

After I left, the new CTO who replaced me migrated it Rackspace's cloud offering. I don't know the costs involved, but now the site averages an 8s load time.

You can't really beat the raw performance for the price of dedicated equipment.

Where I am currently working, thanks to good automation procedures there are only 3 people that are managing 4 datacenters on 3 continents with over 1000 virtual machines and also a couple of hundred physical servers. Those 3 people are 1 linux sysadmin, one network guy and one vmware guy. None of them works fulltime on maintaining the infrastructure just on patching/upgrading/installing new systems and that's 1/2 days a week at most. I have now finished the plan for the migration of two datacenters and that process takes about 2 months with shipping/networking/configuring/installing machines.

I really don't understand the obsession of getting rid of ops / hardware guys and relying on Amazon/Google/CoolCloudProvider to handle everything.

I worked for a large scada company. We collected large amounts of data from thousands of large industrial installations.

One day we got a new VP who came from a well known firm who was a "cloud expert". He moved (nearly) all of our infrastructure to AWS, after producing untold amounts of spreadsheets/power points expressing how much cheaper/better/faster it's going to be.

Long story short, it was 4x more expensive as running it in house. By the time they went back to our own infrastructure, most of the internal sysops(including "The Glue" guy) had moved on and much of the old internal hardware was re-purposed or gone. It was a fiasco that they still have no fully recovered from yet.

I would be very careful in characterizing AWS as the solution for every large scale computer infrastructure problem.

Conversely, I have had excellent experiences with AWS in my current job, although we still have a rather large HPC cluster internally which would never make sense to move to AWS.

> How many ops employees would you need for a fleet of 500 servers in a datacenter? We managed it all with 4 people with AWS.

I'd say our goal is to keep growing and serving more content without _needing_ 500 servers in a data center. We are doing pretty well at that so far. We'll see what happens in the future.