Hacker News new | ask | show | jobs
by chadaustin 3799 days ago
I've experienced a brief full-scale power loss at a data center before. It is unbelievable how much goes wrong. The machines had been chugging along for years, happily doing their job, but on the next boot the hard drives were suddenly corrupted, or the power supplies broken. The impacts of that power outage were felt for at least six months.

It's one of those things where, if you're not regularly cutting power to your data center, you're not building resilience to such a thing happening. So when it does, it's not pretty. :)

2 comments

> if you're not regularly cutting power to your data center, you're not building resilience to such a thing happening

Would love to read examples on who is doing this and how? Reminds me of Netflix's Choas monkey, only applied to electricity. :p

There's a mention of Facebook regularly doing this in the summary section of this instagram engineering post: http://engineering.instagram.com/posts/548723638608102/

EDIT: Here's more info: http://www.datacenterknowledge.com/archives/2014/09/15/faceb...

Awesome, thank you. :)
I remember reading a few years back that Yahoo once a week takes a random data center offline, just to make sure they could do that without issues. They probably didn't actually cut the power ;) But they used it as an argument against investing to much in emergency generators and such: they'll fail or cause accidents and you need the ability to fail-over either way, so make it routine.
I think trying to cut power at least once is better if it's possible. The reason is that digital is just an abstraction over analog, electrical activity. Plus there's actual analog in there doing work, too. So, seeing how all the chips in there respond to an actual and instantaneous drop of the power would be an interesting test of the models they're built against.

Like an above commenter mentioned, weird activity in electrical system can make some products go haywire and even corrupt data in unexpected ways. Of course, simulated takedowns and all appropriate measures for countering common issues should've already happened before a real one. Just to be extra clear there.

Google wrote an article about disaster recovery in 2012. https://queue.acm.org/detail.cfm?id=2371516
What data center was it?

I can't remember the last time there was a power outage at a Tier I or II data center -- they're all N+1, from the cabinet PDUs to the distribution units to the UPSes to the diesel generators. Some even go so far as to connect to multiple in-feeds from different utility providers.

At my company, every piece of server, storage, and network equipment we own is connected via redundant power supplies to different circuits (except for nonessential equipment like monitors; we can simply re-plug them into the functioning circuit). I can't imagine running a datacenter any other way.