Hacker News new | ask | show | jobs
by vitus 484 days ago
> To increase the number of machines under power constraints, data center operators usually cap power use per machine. However, this can cause motherboards to degrade more quickly.

Can anyone elaborate on this point? This is counter to my intuition (and in fact, what I saw upon a cursory search), which is that power capping should prolong the useful lifetime of various components.

The only search results I found that claimed otherwise were indicating that if you're running into thermal throttling, then higher operating temperatures can cause components (e.g. capacitors) to degrade faster. But that's expressly not the case in the article, which looked at various temperature sensors.

7 comments

At the time of our investigation, we found few articles supporting that power caps could potentially cause hardware degradation, though I don't have the exact sources at hand. I see the child comment shared one example, and after some searching, I found a few more sources [1], [2].

That said, I'm not an electronics engineer, so my understanding might not be entirely accurate. It’s possible that the degradation was caused by power fluctuations rather than the power cap itself, or perhaps another factor was at play.

[1] https://electronics.stackexchange.com/questions/65837/can-el... [2] https://superuser.com/questions/1202062/what-happens-when-ha...

The power used by a computer isn't limited by giving it less voltage/current than it should have - if it was, the CPU would crash almost immediately. It's done by reducing the CPU's clock rate until the power it naturally consumes is less than the power limit.
Power = volts * amps

Volts is as supplied by the utility company.

Amps are monitored per rack and the usual data centre response to going over an amp limit is that a fuse blows or the data centre asks you for more money!

The only way you can decrease power used by a server is by throttling the CPUs.

The normal way of throttling CPUs is via the OS which requires cooperation.

I speculate this is possible via the lights out base band controller (which doesn't need the os to be involved), but I'm pretty sure you'd see that in /sys if it was.

Yep, that's weird, I've always read that high power/temp can degrade electronics way faster. Any EE can shed a light here?
As an electronics engineer I have no idea what the author is talking about here and was about to post the same question.
Every rack in a data center has a power budget, which is actually constrained by how much heat the HVAC system can pull out of the DC, rather than how much power is available. Nevertheless it is limited per rack to ensure a few high power servers don't bring down a larger portion of the DC.

I don't know for sure how the limiting is done, but a simple circuit breaker like the ones we have in our houses would be a simple solution for it. That causes the rack to loose power when the circuit breaks, which is not ideal because you loose the whole rack and affect multiple customers.

Another option would be a current/power limiter[0], which would cause more problems because P = U * I. That would make the voltage (U) drop and then the whole system to be undervolted - weird glitches happen here and it's a common way to bypass various security measures in chips. For example, Raspberry Pi ran this challenge [1] to look for this kind of bugs and test how well their chips can handle attacks, including voltage attacks.

[0] - https://en.m.wikipedia.org/wiki/Current_limiting [1] - https://www.raspberrypi.com/news/security-through-transparen...

Computers implement power limits by reducing their own speed until their power consumption falls under the limit. There's no risk of damage and it should actually extend the lifetime due to less heat, as well as increasing the efficiency (computation per watt).

No idea what the article is talking about with the damage. Computers like to run slow when possible. There's basically no downside except they take longer to do things.

One possibility is that at lower power settings, the CPUs don't get as hot, which means the fans don't spin up as much, which can mean that other components also get less airflow and then get hotter than they would otherwise. The fix for this is usually to monitor the temperature of those other components and include that as an input to the fan speed algorithm. No idea if that's what's actually going on here though.
Expert in server power management here. Your intuition is right and the comments/links to the contrary are wrong. Undervolting is unreliable but let's be clear: no one is undervolting servers. I don't even know if it's possible. Power limiting (e.g. RAPL) is completely safe to use because it keeps voltage, frequency, temperature, fan speed, etc within safe bounds.
The only place I could find some answer that sheds some light was StackOverflow:

https://electronics.stackexchange.com/a/65827

> A mosfet needs a certain voltage at its gate to turn fully on. 8V is a typical value. A simple driver circuit could get this voltage directly from the power that also feeds the motor. When this voltage is too low to turn the mosfet fully on a dangerous situation (from the point of view of the moseft) can arise: when it is half-on, both the current through it and the voltage across it can be substantial, resulting in a dissipation that can kill it. Death by undervoltage.