Hacker News new | ask | show | jobs
by mattlondon 1025 days ago
The secret is hosting across failure boundaries so that a single outage like this does not impact you. Self-hosting is fine if you can afford the capex for two physically separate data centers (like really separate - like 100+ miles etc (or more!) to cope with natural disasters) and the staff to operate & maintain them 24/7. For many, this is not realistic.

For those that do need to use cloud, just make sure you are running your services in different failure zones.

3 comments

> For those that do need to use cloud, just make sure you are running your services in different failure zones.

By which time you might as well just roll out your own kit in colocation or your own datacentres.

The cloud providers are nickle and dimers, they charge you for every little tiny thing.

Cloud might look cheap at cents-per-hour, but then you find you need X "services" to deliver your Service and so you are talking about exponential cents-per-hour (X cloud services times x cents-per-hour).

And then running your services across failure zones will of course cost you more beyond the basic double-cost, because most cloud providers charge by the GB for cross-zone traffic. So if you're doing cross-zone replication, that's gonna cost you a pretty penny.

Meanwhile, in your own colo/DC, you have predictable costs. And you can get redundant connections between sites for a flat rate, not some stupid per GB fee.

>like 100+ miles etc (or more!) to cope with natural disasters)

People talk about this often but this failure mode seems to never happen? When was the last time us-east-1 went down because of a natural calamity compared to some technical issue?

Not sure about us-east-1 specifically but there are frequently fairly large natural disasters in the US - there are always hurricanes and stuff, there was that flooding in new York not so long ago, earthquakes in California in the 90s, wildfires etc. And this is just in the US. Basically, don't put all your servers in NYC or all in SF or whatever, but put half in NYC and half in SF and that random hurricane/wildfire/flood/snowstorm etc won't take out both of your data centers.

.... Of course then you have latency issues to think about, but that is often quite application-specific and potentially a good problem to have if a slightly slow website or database or whatever is the biggest problem you have when the alternative would have been a total shutdown.

There are also occasional fires and stuff that take out a whole building (I think OVH had this in France recently?). Ensure that your failure zones are physically separate places, and not just logically-separate zones in the same physical building, or in a building that is next to the one on fire :)

>but there are frequently fairly large natural disasters in the US - there are always hurricanes and stuff, there was that flooding in new York not so long ago, earthquakes in California in the 90s, wildfires etc.

Right but what type of datacenter related incidents did they cause? Did us-east-1 go down because of hurricane sandy? Did us-west-1 go down because of wildfires? I don't seem to remember any datacenter outages caused by wide area natural disasters, whereas I can remember plenty caused by BGP/DNS/config shenanigans.

> Did us-east-1 go Dow because of hurricane sandy?

Nope, but Sandy did a hell of a lot of damage to some key telecommunications infrastructure. Verizon lost multiple floors worth of equipment, cabling, and related infrastructure that served at least their customers across Manhattan.

Having geographical redundancy for mission critical workloads is a good investment if your business is making money. Networked computing is one of the few places we can actually “run away” from a physical source of problems. (Not forever, or universally, of course).

We’re based on the eastern seaboard. You bet we have failsafes in areas less susceptible to natural disaster.

> Did us-east-1 go down because of hurricane sandy?

No, but I was at a company with all the production services in Reston, VA during that storm, and we would have been pretty screwed if Sandy made landfall in the DC area instead of continuing north.

Sandy's flooding in NYC wasn't great for some of the datacenters there, I seem to recall some having trouble, but most were fine.

BGP and DNS are certainly much better at causing disruption, and especially global disruption though.

I remember Hurricane Katrina shutting down lots of online services, and directnic battling to stay online https://www.datacenterknowledge.com/archives/2007/11/05/prov...
Fully agree on this, plus (a very important plus) test that severing down an AZ doesn't bring the services on the good AZ down too. And test this frequently.

I would be very, very surprised if the companies mentioned, in particular banks, weren't running on multiple AZs, but I wouldn't be surprised if the scenario of severing down an AZ was not tested.