Hacker News new | ask | show | jobs
by gemenon 5493 days ago
Having a DR site and not testing that failover works is the big issue I see here. You could check your theory all day long, but if you never actually do it to make sure it works then you might as well not have a DR site. Similar case to taking backups without ever actually verifying you could restore what you need in the event of a loss. While I'm surprised that actually testing DR capability wasn't listed as something to work on, this sort of open write up is very valuable to both customers and the rest of us as engineers.
2 comments

I really liked Netflix's approach to testing backup scenarios by creating the Chaos Monkey[1] (Jeff Atwood had good insight as well[2].) The idea of creating a mechanism to test fail over scenarios is something that I wouldn't have thought of prior to the transparency that companies like Grasshopper have shown. So hat tip to them for opening up on a large failure; It makes the rest of the dev community smarter/better because of their honesty.

[1]: http://techblog.netflix.com/2010/12/5-lessons-weve-learned-u... [2]: http://www.codinghorror.com/blog/2011/04/working-with-the-ch...

One great way to ensure your DR site is tested is not to have a it be a special disaster site at all, but rather to have multiple active sites that each regularly serve production traffic. You still need to provision the capacity to handle your traffic even when an entire site is down.