|
> And when EC2 falls over, like it tends to do a few times a year? Multi-AZ, multi-region complete failures are very, very rare. How often do you get a failure in your data center per year (that you notice)? > You're going to be writing a lot of the same fallover code if you're running on someone else's hardware, so why rent? The answer is in the question -- when rented things fall down and go boomâ„¢, your code runs and someone gets a text message with the receipt. When a handful of the "wrong" disks decide to revert to air-blocking bricks or your upstream network provider has an outage, you're lucky if it's something you can fix by heading to the data center. I promise that AWS or Google is better at running a DC, and unless you're trying to enter the hosting business, I wouldn't advise spending the time and money to meet their uptime and features. I've only managed data storage in the scale of many petabytes (and this was a handful of years ago) and honestly, I think it required at least 20 hours a week of babysitting by various staff. At Snap's scale and traffic patterns (viral content, lots of writes, so on), I imagine this is a very non-trivial spend on scaling, staffing, tech implementation. At 2bb over 5 years, maybe Snap would benefit from rolling their own -- hiring 50 great hackers at a mildly conservative 250k/head (say 200k average + benefits + taxes + employee support costs (HR, payroll, recruiting, legal, etc)), eating a year or two of transition costs off their cloud hosting providers, then probably saving a bit of money even after hardware, bandwidth, facility, insurance costs. Hell, maybe they'd even open source some software and recruiting would get easier after conference talks of how they did it. Or maybe they get bought by Google or Facebook in a year. Snap's in the business of selling ads and getting more eyes on those ads. Whatever enables growth and doesn't serve as a distraction or speedbump is a "fine" decision. |
First, if you don't notice some random/unexpected EC2 instance failures, you don't have a big EC2 deployment. Even though there is a lot of pomp and circumstance around the cloud, when it comes down to it, your instances are still on a physical server in a datacenter somewhere and they can, and sometimes do, fail. In that case, as in every other robust production deployment, your application (hopefully) performs an automatic and graceful failover to its standbys. The location of the standbys is usually an configuration value. Not seeing any unique value proposition here for "the cloud".
The point is that even when you're using EC2, you still have to set all of that up. Contrary to popular belief, EC2 is not a panacea that can magically make your software reliable and redundant. It's just a nice interface that makes it easy to rent servers from Amazon.
The only benefit you get from EC2 is that someone paid by Amazon has to go pull the box, but your company could hire such a guy in-house for _much_ less than it's paying Amazon.
The onus is still on the developers to figure out all of the application stuff that's necessary to accommodate failover and make sure that everything plays nice with each other, and getting that working right is by far the most time-consuming part of deploying a high-availability application.
So EC2 doesn't add any extra resilience; it's just outsourcing the job of pulling a server to an Amazon employee/contractor instead of YourEmployer employee/contractor. If your company is big enough (and at Amazon's prices, you don't have to be very big at all to be "big enough"), that doesn't make sense.
I know EC2 et al are popular because people like buzzwords, but that doesn't make it good business (or does it? Investors love cloud because it keep capex low, and because investors are buzzword-driven like everyone else; saying "cloud" will make them like you more and want to give you more money).
For companies that are still in the garage (literally in the garage), shelling out $20/mo for a couple of cheap VPSes from something like DigitalOcean is going to be just fine. But once you get bigger than that, there's no way to avoid paying attention to this stuff, even if paying Amazon tons of money creates a false psychological connection that makes you think they're doing the work for you.
>The answer is in the question -- when rented things fall down and go boomâ„¢, your code runs and someone gets a text message with the receipt.
Let me fix that for you: when things fall down and go boom, if your code is written and your deployment is configured to support it, your product continues to work, and someone, somewhere, has to get a broom and sweep up some ashes.
Whether or not cloud is a reasonable proposition is primarily a question of whether it makes more sense for that someone who sweeps up the ashes to be on the corporate payroll of YourEmployer or YourCloudProvider.
>I've only managed data storage in the scale of many petabytes (and this was a handful of years ago) and honestly, I think it required at least 20 hours a week of babysitting by various staff. At Snap's scale and traffic patterns (viral content, lots of writes, so on), I imagine this is a very non-trivial spend on scaling, staffing, tech implementation.
EC2 is not a silver bullet. It's just an interface to allow you to rent servers from Amazon. EC2 users still have to babysit stuff, just not the hardware (though they still have to monitor resource usage, clean up disk space, and be prepared for things to blink offline with 0 notice -- again, all the normal things; only difference is that your hardware jockey is accessed through EC2's web support interface instead of Slack/cell).
>At 2bb over 5 years, maybe Snap would benefit from rolling their own -- hiring 50 great hackers at a mildly conservative 250k/head (say 200k average + benefits + taxes + employee support costs (HR, payroll, recruiting, legal, etc))
Vastly overallocating here.
>Hell, maybe they'd even open source some software and recruiting would get easier after conference talks of how they did it.
Unnecessary, there's already tons of great open-source software to handle HA deployments (usually, this is the software underneath the commercial UI that makes everything work; it's surprising how much "revolutionary" commercial software is just glue code and a point-and-click wrapping around an OSS workhorse).
Of course, once you get unicorn-scale, everything has to go custom and/or highly modified because no out of the box solutions can handle the load, and that will be the case whether their hardware is hosted by Google or not. Again, "cloud" does very little to relieve workload for all non-hardware employees.
And the added benefit of being a trendy tech company is that after your company creates some extremely specialized solution, you can open-source it and watch with an uncomfortable mix of amusement and horror as 90%+ of other companies's tech departments contort themselves into pathetic, desperate architecture pretzels so that they can become cool by abandoning a stable, proven, mature stack for your company's experimental, sputtering, duct-taped abomination that requires a PhD to even get to compile.
This pattern has become so commonplace that reciting any specific example feels trite. You can probably name 12 off the top of your head. Hadoop in particular is a victim of many gross offenses of this type.
>Snap's in the business of selling ads and getting more eyes on those ads. Whatever enables growth and doesn't serve as a distraction or speedbump is a "fine" decision.
Sure, but they don't have to set massive gobs of money on fire for no reason along the way. But then, I guess they wouldn't be part of the Silicon Valley family if they didn't.