Having a basic scaffolding in place on a hosted cloud and making sure your devops scripts are up to snuff is a good idea when you don't know how much infrastructure you need, because then when the situation calls for it you can fire up a new node on-demand.
But unless you're still "in the garage" and a couple of DigitalOcean droplets are good enough, it's going to be much, much cheaper and usually much wiser to run your core infrastructure on your own colocated bare metal.
I've seen companies increase their server expenses by ~$1M/yr by moving everything to EC2, and they sit around congratulating themselves for it because now "they're in the cloud". There's no reason to do that!
Little humorous tangent: an AWS rep told someone I've worked with that Amazon really wanted to help them secure better pricing, because as new CFOs come from self-hosted companies and into AWS-dependent companies, the CFO's eyes bug out when they see the Amazon bills and EC2 becomes the first thing on the chopping block.
Script your stuff out in Ansible or something similar, run it on your own hardware, and use GCloud/EC2 as secondary data centers for failover/backup/support/emergency bursts/whatever. You can have the flexibility without paying through the nose.
> Script your stuff out in Ansible or something similar, run it on your own hardware, and use GCloud/EC2 as secondary data centers for failover/backup/support/emergency bursts/whatever. You can have the flexibility without paying through the nose.
Except then you have to run your own networking and when shit fails (as disks, links, and switches are want to do), it's now "your problem". Hybrid clouds and not being a tenant is nice, but not without time and monetary costs -- by the time you have geographically distinct failover, you've also spent a non-trivial amount of opportunity costs making phone calls, flying around, and writing lines of code and config for things customers don't even know exist.
And when EC2 falls over, like it tends to do a few times a year? Hosts fall over, stuff dies. Something the scale of Snap, you're going to be doing setups that look a lot like cloud anyways. Bringing new systems up either by cloning a disk or through using PXE, setting up clustering, possibly by using the stuff they're already using, etc. You're going to be writing a lot of the same fallover code if you're running on someone else's hardware, so why rent?
> And when EC2 falls over, like it tends to do a few times a year?
Multi-AZ, multi-region complete failures are very, very rare. How often do you get a failure in your data center per year (that you notice)?
> You're going to be writing a lot of the same fallover code if you're running on someone else's hardware, so why rent?
The answer is in the question -- when rented things fall down and go boom™, your code runs and someone gets a text message with the receipt.
When a handful of the "wrong" disks decide to revert to air-blocking bricks or your upstream network provider has an outage, you're lucky if it's something you can fix by heading to the data center. I promise that AWS or Google is better at running a DC, and unless you're trying to enter the hosting business, I wouldn't advise spending the time and money to meet their uptime and features.
I've only managed data storage in the scale of many petabytes (and this was a handful of years ago) and honestly, I think it required at least 20 hours a week of babysitting by various staff. At Snap's scale and traffic patterns (viral content, lots of writes, so on), I imagine this is a very non-trivial spend on scaling, staffing, tech implementation.
At 2bb over 5 years, maybe Snap would benefit from rolling their own -- hiring 50 great hackers at a mildly conservative 250k/head (say 200k average + benefits + taxes + employee support costs (HR, payroll, recruiting, legal, etc)), eating a year or two of transition costs off their cloud hosting providers, then probably saving a bit of money even after hardware, bandwidth, facility, insurance costs. Hell, maybe they'd even open source some software and recruiting would get easier after conference talks of how they did it. Or maybe they get bought by Google or Facebook in a year. Snap's in the business of selling ads and getting more eyes on those ads. Whatever enables growth and doesn't serve as a distraction or speedbump is a "fine" decision.
>Multi-AZ, multi-region complete failures are very, very rare. How often do you get a failure in your data center per year (that you notice)?
First, if you don't notice some random/unexpected EC2 instance failures, you don't have a big EC2 deployment. Even though there is a lot of pomp and circumstance around the cloud, when it comes down to it, your instances are still on a physical server in a datacenter somewhere and they can, and sometimes do, fail. In that case, as in every other robust production deployment, your application (hopefully) performs an automatic and graceful failover to its standbys. The location of the standbys is usually an configuration value. Not seeing any unique value proposition here for "the cloud".
The point is that even when you're using EC2, you still have to set all of that up. Contrary to popular belief, EC2 is not a panacea that can magically make your software reliable and redundant. It's just a nice interface that makes it easy to rent servers from Amazon.
The only benefit you get from EC2 is that someone paid by Amazon has to go pull the box, but your company could hire such a guy in-house for _much_ less than it's paying Amazon.
The onus is still on the developers to figure out all of the application stuff that's necessary to accommodate failover and make sure that everything plays nice with each other, and getting that working right is by far the most time-consuming part of deploying a high-availability application.
So EC2 doesn't add any extra resilience; it's just outsourcing the job of pulling a server to an Amazon employee/contractor instead of YourEmployer employee/contractor. If your company is big enough (and at Amazon's prices, you don't have to be very big at all to be "big enough"), that doesn't make sense.
I know EC2 et al are popular because people like buzzwords, but that doesn't make it good business (or does it? Investors love cloud because it keep capex low, and because investors are buzzword-driven like everyone else; saying "cloud" will make them like you more and want to give you more money).
For companies that are still in the garage (literally in the garage), shelling out $20/mo for a couple of cheap VPSes from something like DigitalOcean is going to be just fine. But once you get bigger than that, there's no way to avoid paying attention to this stuff, even if paying Amazon tons of money creates a false psychological connection that makes you think they're doing the work for you.
>The answer is in the question -- when rented things fall down and go boom™, your code runs and someone gets a text message with the receipt.
Let me fix that for you: when things fall down and go boom, if your code is written and your deployment is configured to support it, your product continues to work, and someone, somewhere, has to get a broom and sweep up some ashes.
Whether or not cloud is a reasonable proposition is primarily a question of whether it makes more sense for that someone who sweeps up the ashes to be on the corporate payroll of YourEmployer or YourCloudProvider.
>I've only managed data storage in the scale of many petabytes (and this was a handful of years ago) and honestly, I think it required at least 20 hours a week of babysitting by various staff. At Snap's scale and traffic patterns (viral content, lots of writes, so on), I imagine this is a very non-trivial spend on scaling, staffing, tech implementation.
EC2 is not a silver bullet. It's just an interface to allow you to rent servers from Amazon. EC2 users still have to babysit stuff, just not the hardware (though they still have to monitor resource usage, clean up disk space, and be prepared for things to blink offline with 0 notice -- again, all the normal things; only difference is that your hardware jockey is accessed through EC2's web support interface instead of Slack/cell).
>At 2bb over 5 years, maybe Snap would benefit from rolling their own -- hiring 50 great hackers at a mildly conservative 250k/head (say 200k average + benefits + taxes + employee support costs (HR, payroll, recruiting, legal, etc))
Vastly overallocating here.
>Hell, maybe they'd even open source some software and recruiting would get easier after conference talks of how they did it.
Unnecessary, there's already tons of great open-source software to handle HA deployments (usually, this is the software underneath the commercial UI that makes everything work; it's surprising how much "revolutionary" commercial software is just glue code and a point-and-click wrapping around an OSS workhorse).
Of course, once you get unicorn-scale, everything has to go custom and/or highly modified because no out of the box solutions can handle the load, and that will be the case whether their hardware is hosted by Google or not. Again, "cloud" does very little to relieve workload for all non-hardware employees.
And the added benefit of being a trendy tech company is that after your company creates some extremely specialized solution, you can open-source it and watch with an uncomfortable mix of amusement and horror as 90%+ of other companies's tech departments contort themselves into pathetic, desperate architecture pretzels so that they can become cool by abandoning a stable, proven, mature stack for your company's experimental, sputtering, duct-taped abomination that requires a PhD to even get to compile.
This pattern has become so commonplace that reciting any specific example feels trite. You can probably name 12 off the top of your head. Hadoop in particular is a victim of many gross offenses of this type.
>Snap's in the business of selling ads and getting more eyes on those ads. Whatever enables growth and doesn't serve as a distraction or speedbump is a "fine" decision.
Sure, but they don't have to set massive gobs of money on fire for no reason along the way. But then, I guess they wouldn't be part of the Silicon Valley family if they didn't.
Snap is using appengine, which transparently manages scale, availability, resiliency, deployment, and so forth. It's a higher level of service than ec2. Thus many of the valid concerns you describe do not apply to snap, or are at least minimized.
> First, if you don't notice some random/unexpected EC2 instance failures, you don't have a big EC2 deployment.
The parent didn't claim they don't happen, just that (1) they were rare (a point you agree with, given the minimum usage needed to notice them) and (2) multi-AZ, multi-region failures nearly non-existent.
> The point is that even when you're using EC2, you still have to set all of that up.
It takes literally minutes to set up an ELB and Autoscaling group across five availability zones. How long does the non-cloud version of that take?
> First, if you don't notice some random/unexpected EC2 instance failures, you don't have a big EC2 deployment. ...Not seeing any unique value proposition here for "the cloud".
Because when something fails, you don't have to care about the "why" as long as you can replace it. I see about 4 instances needing a maintenance per month per 1000. That's reasonable enough to not demand someone be full-time focused on making sure that only the good lights blink on the hardware.
> The point is that even when you're using EC2, you still have to set all of that up. Contrary to popular belief, EC2 is not a panacea that can magically make your software reliable and redundant.
You're making a strawman by suggesting people think it's a panacea. The advantage is that a lot of the work, maintenance, and feature improvements for 'infrastructure as code' is handled for you. Cloud hosting means writing the software layer and being done, no managing the infrastructure services, facilities, hardware, business relationships involved with rack/stack.
> It's just a nice interface that makes it easy to rent servers from Amazon.
To be fair, it's a _very_ nice interface.
> I know EC2 et al are popular because people like buzzwords, but that doesn't make it good business (or does it? Investors love cloud because it keep capex low, and because investors are buzzword-driven like everyone else; saying "cloud" will make them like you more and want to give you more money).
If you think cloud hosting is popular because of op-ex or buzzwords, I think you're out of touch. EC2 and Google Cloud are popular because they let you focus on getting shit done, even when you have variadic workloads that are uptime dependent.
> For companies that are still in the garage (literally in the garage), shelling out $20/mo for a couple of cheap VPSes from something like DigitalOcean is going to be just fine. But once you get bigger than that, there's no way to avoid paying attention to this stuff, even if paying Amazon tons of money creates a false psychological connection that makes you think they're doing the work for you.
They _are_ doing a lot of work for you. You say $20 is the point that it makes more sense to self-host. I'll be charitable and round that up to $100, but even at that price, there is _no way_ you'll be able to get something as fault tolerant or low-cost as a cloud hosted solution. Do you really think that for $100 a month you can self-host geo-close servers with redundancy to the point that you don't have to think about it? Keep in mind that "two is one and one is none" when planning your hardware purchase.
> Vastly overallocating here.
No, that's conservative for a major US city (e.g. where Snap would be doing the hiring). Have you tried to pull a handful of really good system hackers out of thin air recently? Even if you can get them, they're not cheap, and you'd need a sizable team to pull off the highly-redundant world-wide install that Snap needs for its growth projections. It starts off expensive to hire good tech and gets more spendy the longer you're fishing.
And that's even ignoring the costs on productivity (for that and other employees) when an employee isn't happy or decides it's time to leave -- staffing also takes money and attention to maintain.
> And the added benefit of being a trendy tech company is that after your company creates some extremely specialized solution, you can open-source it and watch with an uncomfortable mix of amusement and horror as 90%+ of other companies's tech departments contort themselves into pathetic, desperate architecture pretzels so that they can become cool by abandoning a stable, proven, mature stack for your company's experimental, sputtering, duct-taped abomination that requires a PhD to even get to compile.
You seem like you're speaking from personal experience. Having a working infrastructure that isn't a barrier to growth isn't trendy or sexy, it's a base competency for any internet-reliant business model.
> Sure, but they don't have to set massive gobs of money on fire for no reason along the way. But then, I guess they wouldn't be part of the Silicon Valley family if they didn't.
This isn't setting "massive gobs of money on fire for no reason", this is going with a high-performance datacenter that someone else maintains. They clearly have something very big in mind and I doubt they made a multi-$bb commitment without asking themselves "are we lighting this money on fire?"
> Of course, maybe they got some killer promotional deal with Google
For sure. How much is this free marketing that Google cloud service is getting worth? I'm pretty sure whatever discounted deal Google gave Snap is more than made up by this free marketing blitz they're getting.
Snap’s been a happy and public customer for some time, so any “free marketing blitz” would a) have essentially been used up before, and b) would truly have to be remarkable to work against some form of discount where the non-discounted remainder /still/ represents $2,000,000,000 over four years.
First of all, extent of Snap's dependence on Google Cloud and this extreme volume of spend was never public.
Also, there's a difference between something being public (like press releases) and actively generating buzz where lots of (relevant) people are actively talking about this.
I think if you were in a GCE sales meeting yesterday you'd have noticed a lot of people jumping up and down in joy. They've been playing second fiddle to AWS and in desperate catchup mode. Their next cold call got so much easier. Their next close got so much easier. Screw all that, their inbounds suddenly went through the roof. Lots of smaller startups etc. who would have never thought of Google cloud as an option are now seriously considering it. A lot of people who are already on AWS just signed up for GCE out of curiosity "just to see what the big deal is about". I don't think there's any way to overstate the impact of this news on Google Cloud's future.