Hacker News new | ask | show | jobs
by collaborative 1023 days ago
Guessing affected customers had to spend time and effort on top of ongoing high cloud bills

I've slept so much better since I began hosting, producing energy, and cooling on-prem

2 comments

The secret is hosting across failure boundaries so that a single outage like this does not impact you. Self-hosting is fine if you can afford the capex for two physically separate data centers (like really separate - like 100+ miles etc (or more!) to cope with natural disasters) and the staff to operate & maintain them 24/7. For many, this is not realistic.

For those that do need to use cloud, just make sure you are running your services in different failure zones.

> For those that do need to use cloud, just make sure you are running your services in different failure zones.

By which time you might as well just roll out your own kit in colocation or your own datacentres.

The cloud providers are nickle and dimers, they charge you for every little tiny thing.

Cloud might look cheap at cents-per-hour, but then you find you need X "services" to deliver your Service and so you are talking about exponential cents-per-hour (X cloud services times x cents-per-hour).

And then running your services across failure zones will of course cost you more beyond the basic double-cost, because most cloud providers charge by the GB for cross-zone traffic. So if you're doing cross-zone replication, that's gonna cost you a pretty penny.

Meanwhile, in your own colo/DC, you have predictable costs. And you can get redundant connections between sites for a flat rate, not some stupid per GB fee.

>like 100+ miles etc (or more!) to cope with natural disasters)

People talk about this often but this failure mode seems to never happen? When was the last time us-east-1 went down because of a natural calamity compared to some technical issue?

Not sure about us-east-1 specifically but there are frequently fairly large natural disasters in the US - there are always hurricanes and stuff, there was that flooding in new York not so long ago, earthquakes in California in the 90s, wildfires etc. And this is just in the US. Basically, don't put all your servers in NYC or all in SF or whatever, but put half in NYC and half in SF and that random hurricane/wildfire/flood/snowstorm etc won't take out both of your data centers.

.... Of course then you have latency issues to think about, but that is often quite application-specific and potentially a good problem to have if a slightly slow website or database or whatever is the biggest problem you have when the alternative would have been a total shutdown.

There are also occasional fires and stuff that take out a whole building (I think OVH had this in France recently?). Ensure that your failure zones are physically separate places, and not just logically-separate zones in the same physical building, or in a building that is next to the one on fire :)

>but there are frequently fairly large natural disasters in the US - there are always hurricanes and stuff, there was that flooding in new York not so long ago, earthquakes in California in the 90s, wildfires etc.

Right but what type of datacenter related incidents did they cause? Did us-east-1 go down because of hurricane sandy? Did us-west-1 go down because of wildfires? I don't seem to remember any datacenter outages caused by wide area natural disasters, whereas I can remember plenty caused by BGP/DNS/config shenanigans.

> Did us-east-1 go Dow because of hurricane sandy?

Nope, but Sandy did a hell of a lot of damage to some key telecommunications infrastructure. Verizon lost multiple floors worth of equipment, cabling, and related infrastructure that served at least their customers across Manhattan.

Having geographical redundancy for mission critical workloads is a good investment if your business is making money. Networked computing is one of the few places we can actually “run away” from a physical source of problems. (Not forever, or universally, of course).

We’re based on the eastern seaboard. You bet we have failsafes in areas less susceptible to natural disaster.

> Did us-east-1 go down because of hurricane sandy?

No, but I was at a company with all the production services in Reston, VA during that storm, and we would have been pretty screwed if Sandy made landfall in the DC area instead of continuing north.

Sandy's flooding in NYC wasn't great for some of the datacenters there, I seem to recall some having trouble, but most were fine.

BGP and DNS are certainly much better at causing disruption, and especially global disruption though.

I remember Hurricane Katrina shutting down lots of online services, and directnic battling to stay online https://www.datacenterknowledge.com/archives/2007/11/05/prov...
Fully agree on this, plus (a very important plus) test that severing down an AZ doesn't bring the services on the good AZ down too. And test this frequently.

I would be very, very surprised if the companies mentioned, in particular banks, weren't running on multiple AZs, but I wouldn't be surprised if the scenario of severing down an AZ was not tested.

What about data center colocation? When you simply rent the energy, cooling, etc, but the hardware is yours? Do you think it's a nice middle ground?
> Do you think it's a nice middle ground?

It is.

The cloud fanbois will tell you until their blue in the face that its not.

I fully accept that the cloud is great for bursty workloads where you're doing nothing and then suddenly half the planet needs your service for a couple of days. That is clear.

But if you've got a reasonably stable baseload running 24x7x365 and a few modest bursts here and there then honestly people need to do the math, because if you look at beyond the short-term figures, the cloud tends to work out much more expensive than colo if you look at for example a three-year period.

Most people don't need the scale the cloud gives. They think they do, but really most people will never grow to FANG scale as much as they may dream it !

I believe the real secret reason the cloud is so popular among developers (based on 10+ years of experience) is that cloud providers are so much nicer and faster to deal with than your company IT department.

Also on the price side, I'm not comparing the price of cloud vs colo, but the price of cloud vs what the company IT department charges my department for being allowed to use one of 'their' colo servers, and that is many times what a cloud server costs. (as a real world example, the place I used to work internally invoiced $150/server/month for a virtual server that would cost me $20/server/month on AWS before any discounts).

Cloud lives not by competing against smart people running their own servers, but against inefficient internal IT services, and there they have them beat both on price and quality.

I'm the ops guy on a small dev team, and I run a sort of hybrid setup for prod that does involve me working on hardware in a colo sometimes, though fairly rarely (I'd love to spend about half my time hauling servers around and cabling stuff so that I'm not stuck at a desk all day, but that's not the way it is).

The whole point of my job is to enable developers to deliver code that provides customers value. On that level I actually embrace the common "condescensions" (so-to-speak) that I'm tech support for developers or a YAML wrangler.

I actually had an experience recently where a developer asked to make some changes to our infrastructure. I pretty much developed our container orchestration system (based on Docker Swarm rather than Kubernetes - a choice our architect made that I've come to appreciate), so I walked him through how my IaC works, told him what he needed to change and then reviewed his pull request and applied the changes. I guess we're on a devops journey now if I want to put it in corpo-tech speak.

Anyway, I suppose a lot of IT departments/guys get lost in creating their "perfect" unassailable systems and forget that the big picture is that the job is to enable customers; most directly are likely to be the developers or other internal employees, but ultimately the end customer who's handing you money to solve their problems.

Also the vast array of managed services. Managed databases, message queues, infinite storage, data warehouses, caches , etc etc. Many of which are very complicated to host well yourself and operationize (failovers, monitoring , backups etc)

This idea that you can build a DC that competes on cost for rented cloud compute - it might be technically true but it’s mostly missing the point of why modern shops prefer the cloud.

> Many of which are very complicated to host well yourself and operationize (failovers, monitoring , backups etc)

Oh you are hilarious.

Time for your daily reminder that failovers, monitoring and backups DO NOT EXIST in the cloud UNLESS (a) you configure and manage them (b) deploy your services in multiple zones (and spend $$$$$ along the way).

Lots of people cannot do (a) properly and it is regularly demonstrated by AWS US-East-1 and others that not many people do (b) fully, or in many cases, don't do it at all.

So yeah, the cloud is still "complicated", it's just a different sort of complicated. And if you do failovers, monitoring and backups properly, the cloud is still "expensive", its just a different sort of expensive.

hey maybe avoid the patronizing crap? I've been involved in running at-scale properties in the cloud and not for 20 years or so now so, whilst i dont know everything i do somewhat know what im talking about.

Making an RDS postgres instance multi-az with automatic failover, and bulletproof backups to s3 is ticking a couple of boxes. Compared to building all of that yourself at the same level of uptime - its not complexity of the same magnitude at all. And sure it will cost you more for the instances for redundancy, but its pretty easily worth it - i dont have to pay an ops team to babysit my databases. Thats just postgres - not even getting into things like aurora, dynamo, kinesis, sqs, lambda - things that either dont have a self-hosted equivalent at all, or if they do are way more complicated to run at scale than PG.

In some cases its trading cloud costs for personnel costs. Both opex. But in many others its having access to services, datastores etc that i couldnt otherwise have as a dev.

I see. But the cloud is much more than just VMs. I don't know if I'd want to manage an equivalent to SQS on my own. Maybe I should try it out and see what happens :)