Hacker News new | ask | show | jobs
by tinco 4978 days ago
Ofcourse they have redundancy, just not cross-datacenter redundancy. And if you knew anything about cross-datacenter redundancy you'd know that cross-datacenter redundancy is something you do not decide upon lightly.

Then again, having cross-datacenter backups that can easily be taken online would be a bit more professional than 'we want to physically move the servers'.

2 comments

I'll be the first to admit I don't really know anything about cross-datacenter redundancy; however, I always thought that was pretty high on the list once you had SaaS products that were pulling in enough revenue to warrant full-time employees outside of the founders. What are the reasons why you would choose not to do it? Are they all financial or are there other implications?
I think the biggest argument against complex cross-DC redundancy is that it can add complexity and failure modes, not just during the emergency, but every day.

As a simple example, I've seen at least a half dozen people who had issues because they thought it was as simple as throwing a mysql node into each datacenter, only to discover (much later) that the databases had become inconsistent and that failing over created bigger problems than it solved.

Similarly, I've seen complex high-availability infrastructures where the complexity of that infrastructure created more net downtime than a simpler infrastructure would've, it just went down at slightly different times.

And you really need to think about the implications of various failure modes. If you go down in the middle of a transaction, is that a problem for your application? Is it okay to roll back to data that's 3 hours old? 3 minutes? 3 seconds?

There are any number of situations where it's reasonable to say "we expect our datacenter will fail once every couple decades and when it does, we'll be down for a couple days."

Great explanation, thank you.
Are you kidding me? If you run big sites like FogBugz then ofcourse you have cross-datacenter redundancy. It's not complicated to host your staging site in another physical location and point the DNS records to it when things go pear-shaped.
Yes, so this staging site of you has exactly the same databases as your production site? Without customer data Fogbugz and Trello are useless. This means that this simple staging site of yours needs to have all data replicated to it, which means it also needs the same hardware provisioned for it, effectively doubling your physical costs, your maintenance cost and reducing the simplicity of your architecture. Ofcourse, if you're big enough you can afford to do this, and one could argue fogcreek is big enough. I'm just saying it's not a simple no-brainer.

What is a simple no-brainer how ever is to have offline offsite backups that can easily brought online. A best practice is to have your deployment automated in such a way that deployment to a new datacenter that already has your data should be a trivial thing.

But yeah, if you're running a tight ship something things like that go overboard without anyone noticing.

Remember the story of the 100% uptime banking software, that ran for years without ever going down, always applying the patches at runtime. Then one day a patch finally came in that required a reboot, and it was discovered that in all the years of runtime patches without reboots, it was never tested if the machine could actually still boot, and ofcourse it couldn't :)

Data should be backed up to staging nightly anyway. There should also be scripts in place to start this process at an arbitrary point in time and to import the data into the staging server. You do not need to match the hardware if you use cloud hosting since you can scale up whenever you want.

Here's where it gets really simple. Resize the staging instance to match live. Put live into maintenance mode and begin the data transfer to staging (with a lot of cloud providers, step #1 and #2 can be done in parallel). As soon as it finishes copying, take live down, point the DNS records at staging and wait for a few minutes. Staging is now live, with all of live's data. Problem solved. Total downtime: hardly anything compared to not being prepared. Total dataloss: none.

I fully agree that this is how it could, and perhaps should be done. But you assume they are already on cloud hosting, which they obviously aren't. Ofcourse this is also a choice that has to be made consciously. Especially since fogcreek has been around a lot longer than the big cloud providers.

You can look to Amazon to see that cloud architecture brings with it hidden complexity that also increases risk of downtime while you relinguish a lot of control on for example the latency and bandwidth between your nodes.

What I don't know by the way, is wether the total cost of ownership is larger for colocation or for cloud hosting.

Why do you think they aren't doing this?

Possible explanations

1) Their engineers never thought of it

2) They considered it, and it is as simple as you think... but they don't care about uptime.

3) Implementing geographic redundancy is harder than you think given whatever other constraints or environment they face.

4) Some other explanation

#3 seems like the most likely explanation to me.

So which of your big sites have cross-datacenter redundancy? Why don't you talk about the decision process that lead to that and costs associated?

Unless you're just talking out of your arse of course and you have no experience with that sort of thing at all.

The relationship between willingness to opine on a topic and knowledge of that topic:

http://www.smbc-comics.com/?id=2475