Hacker News new | ask | show | jobs
by Supreme 4973 days ago
Are you kidding me? If you run big sites like FogBugz then ofcourse you have cross-datacenter redundancy. It's not complicated to host your staging site in another physical location and point the DNS records to it when things go pear-shaped.
2 comments

Yes, so this staging site of you has exactly the same databases as your production site? Without customer data Fogbugz and Trello are useless. This means that this simple staging site of yours needs to have all data replicated to it, which means it also needs the same hardware provisioned for it, effectively doubling your physical costs, your maintenance cost and reducing the simplicity of your architecture. Ofcourse, if you're big enough you can afford to do this, and one could argue fogcreek is big enough. I'm just saying it's not a simple no-brainer.

What is a simple no-brainer how ever is to have offline offsite backups that can easily brought online. A best practice is to have your deployment automated in such a way that deployment to a new datacenter that already has your data should be a trivial thing.

But yeah, if you're running a tight ship something things like that go overboard without anyone noticing.

Remember the story of the 100% uptime banking software, that ran for years without ever going down, always applying the patches at runtime. Then one day a patch finally came in that required a reboot, and it was discovered that in all the years of runtime patches without reboots, it was never tested if the machine could actually still boot, and ofcourse it couldn't :)

Data should be backed up to staging nightly anyway. There should also be scripts in place to start this process at an arbitrary point in time and to import the data into the staging server. You do not need to match the hardware if you use cloud hosting since you can scale up whenever you want.

Here's where it gets really simple. Resize the staging instance to match live. Put live into maintenance mode and begin the data transfer to staging (with a lot of cloud providers, step #1 and #2 can be done in parallel). As soon as it finishes copying, take live down, point the DNS records at staging and wait for a few minutes. Staging is now live, with all of live's data. Problem solved. Total downtime: hardly anything compared to not being prepared. Total dataloss: none.

I fully agree that this is how it could, and perhaps should be done. But you assume they are already on cloud hosting, which they obviously aren't. Ofcourse this is also a choice that has to be made consciously. Especially since fogcreek has been around a lot longer than the big cloud providers.

You can look to Amazon to see that cloud architecture brings with it hidden complexity that also increases risk of downtime while you relinguish a lot of control on for example the latency and bandwidth between your nodes.

What I don't know by the way, is wether the total cost of ownership is larger for colocation or for cloud hosting.

Why do you think they aren't doing this?

Possible explanations

1) Their engineers never thought of it

2) They considered it, and it is as simple as you think... but they don't care about uptime.

3) Implementing geographic redundancy is harder than you think given whatever other constraints or environment they face.

4) Some other explanation

#3 seems like the most likely explanation to me.

So which of your big sites have cross-datacenter redundancy? Why don't you talk about the decision process that lead to that and costs associated?

Unless you're just talking out of your arse of course and you have no experience with that sort of thing at all.

The relationship between willingness to opine on a topic and knowledge of that topic:

http://www.smbc-comics.com/?id=2475