Hacker News new | ask | show | jobs
by YetAnotherNick 2146 days ago
You wrote one server but describe the failure modes of having one data center. I think it is very very uncommon and hard to allow for data center level issue. After all Instagram and 100 other site failed when one AWS data center went down. I would interested to know how/whether anyone's backend will work if any data center and its databases completely fails due to fire/earthquake/networking etc.

Second thing is having multiple machines for server. In theory it might help in increasing the availability but in practice I haven't seen any random issue due to machine which occurs just based on probability. I think almost all failure modes that exist, they are correlated between machines. eg suppose you have data loss on one machine, you could more likely than not, blame it on code and it would be similar across machines.

1 comments

Re: single datacenter. At the basic level, you need a second datacenter with enough machines to provide your service (or a emergency version at least), replication of data, and a way to switch traffic. It's doable, but expensive in capital and development. If you're dependant on outsourced services, they also need to be available from both datacenters and not served from only one. In an ideal world, your two datacenters would be managed by different companies, so you would avoid any one company's global routing failure (IBM had one recently).

Re: multiple servers. Power supplies fail, memory modules fail, cpus fail, fans fail, storage drives fail. Sometimes those are correlated --- the HP SSDs that failed when the power on hours hit a limit (two separate models) are going to be pretty correlated if they were purchased new and stuck into servers at a similar time and then on 24/7. Most of those failures aren't that correlated though. Software failures would be more likely to be correlated though, of course.

The key thing is to really think about what the cost for being down is, how long is acceptable/desirable to be down, and how much you're willing to spend to hit those goals.

> In an ideal world, your two datacenters would be managed by different companies, so you would avoid any one company's global routing failure

I can't understand this. I think transferring servers would be the the least of problems. Its the transferring of database and maintaining consistent version of databases in both the locations. Moving the snapshots after every X minutes doesn't maintain consistency. I would like to read about any company that is able to do this, as honestly it sounds really hard to me. Is there any writeup of IBM thing you mentioned?

Re: IBM outage

https://news.ycombinator.com/item?id=23471698

TLDR is connectivity to and from the IBM cloud datacenters (which includes softlayer) was generally unavailable, globally, for a couple hours. If you were in multiple IBM datacenters, you were as down as if you were in only one (mostly, I was poking around when it was wrapping up, and some datacenters came back earlier than others).

> Its the transferring of database and maintaining consistent version of databases in both the locations. Moving the snapshots after every X minutes doesn't maintain consistency. I would like to read about any company that is able to do this, as honestly it sounds really hard to me

The gold standard here is two-phase commit. Of course, that subjects every transaction to delay, so people tend not to do that. The close enough version is MySQL (or other DB) replication, monitor that the replication stream is pretty current and hope not a lot is lost when a datacenter dies. There's room to fiddle with failover and reconciliation; I recommend against automatic failover for writes, because it gets really messy if you get a split brain situation --- some of your hosts see one write server available and others see another, and you may accept conflicting writes. A few minutes running like that can mean days or weeks of reconciliation, if you didn't build for reconciliation.