| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by toast0 2146 days ago

Re: single datacenter. At the basic level, you need a second datacenter with enough machines to provide your service (or a emergency version at least), replication of data, and a way to switch traffic. It's doable, but expensive in capital and development. If you're dependant on outsourced services, they also need to be available from both datacenters and not served from only one. In an ideal world, your two datacenters would be managed by different companies, so you would avoid any one company's global routing failure (IBM had one recently).

Re: multiple servers. Power supplies fail, memory modules fail, cpus fail, fans fail, storage drives fail. Sometimes those are correlated --- the HP SSDs that failed when the power on hours hit a limit (two separate models) are going to be pretty correlated if they were purchased new and stuck into servers at a similar time and then on 24/7. Most of those failures aren't that correlated though. Software failures would be more likely to be correlated though, of course.

The key thing is to really think about what the cost for being down is, how long is acceptable/desirable to be down, and how much you're willing to spend to hit those goals.

1 comments

YetAnotherNick 2146 days ago

> In an ideal world, your two datacenters would be managed by different companies, so you would avoid any one company's global routing failure

I can't understand this. I think transferring servers would be the the least of problems. Its the transferring of database and maintaining consistent version of databases in both the locations. Moving the snapshots after every X minutes doesn't maintain consistency. I would like to read about any company that is able to do this, as honestly it sounds really hard to me. Is there any writeup of IBM thing you mentioned?

toast0 2146 days ago

Re: IBM outage

https://news.ycombinator.com/item?id=23471698

TLDR is connectivity to and from the IBM cloud datacenters (which includes softlayer) was generally unavailable, globally, for a couple hours. If you were in multiple IBM datacenters, you were as down as if you were in only one (mostly, I was poking around when it was wrapping up, and some datacenters came back earlier than others).

> Its the transferring of database and maintaining consistent version of databases in both the locations. Moving the snapshots after every X minutes doesn't maintain consistency. I would like to read about any company that is able to do this, as honestly it sounds really hard to me

The gold standard here is two-phase commit. Of course, that subjects every transaction to delay, so people tend not to do that. The close enough version is MySQL (or other DB) replication, monitor that the replication stream is pretty current and hope not a lot is lost when a datacenter dies. There's room to fiddle with failover and reconciliation; I recommend against automatic failover for writes, because it gets really messy if you get a split brain situation --- some of your hosts see one write server available and others see another, and you may accept conflicting writes. A few minutes running like that can mean days or weeks of reconciliation, if you didn't build for reconciliation.