Hacker News new | ask | show | jobs
by preview 6128 days ago
I wonder about the single point of failure posed by the power supplies. One failed box is not a big deal (since I assume the data is replicated over several). But, what if they get a bad batch of supplies and see a relatively high failure rate? I wonder how high a power supply failure rate they can handle.

The need to stagger the power on of the two supplies poses a problem. What if power to a data center is lost? When power is restored, all box will try to start, blowing all fuses. Granted, this is a catastrophic event, so its frequency should be very low. But, this also seems like an area that could be automated.

1 comments

You should only use 75% of the rated capacity of a circuit, which means you have enough power to turn them all on at once.

some of the more expensive managed power supplies also support a staggered power on after power fail. But I don't worry about it; only using 75% of the power circuit solves that problem for me.

Unfortunately, using 75% of the rated capacity may not be enough to handle the inrush. The article discusses this point, "...if you power up both PSUs at the same time, it can draw a large (14 amp) spike of 120V power from the socket." That would mean one pod per 20A circuit. Ugh. In normal operating conditions, a 5.6A max load would allow three pods per circuit.

Addressing this would require a little bit of design, but the problem is relatively simple. If they wanted to get fancy, they could add a chaining feature--pods on the same circuit would be connected together so that they'd power on serially. This would get away from their goal of using off-the-shelf parts. It is, as with many things, an engineering trade-off.

The PDUs that support 'staggered power-on' are 'off the shelf' - if that is not an option (really, we're talking maybe $500 for each 20a circuit, retail) the next thing I'd do is set 'power on after power fail' to off, then have some remotely accessible way to trip the power button. (I'm working on a solution to that particular problem, but that's not 'off the shelf' - yeah, everything is on PDUs I can trip remotely, but there are reasons why it is much better to ungracefully reboot with the 'reset' jumper than to ungracefully reboot via cutting off the power.)

from there it would be easy enough to have an automated process turn on servers one at a time.