Hacker News new | ask | show | jobs
by MaulingMonkey 1301 days ago
> It seems inherently dangerous to me

In the old days, it kinda was - at least to your hardware.

Then people realized that blowing up components because a fan failed, or became unplugged, or a filter clogged with dust, maybe wasn't a great user experience and/or caused more in-warranty returns that required replacing hardware ($$expensive$$!), and implemented thermal throttling and thermal cutoffs. Nearly two decades ago at this point, I helped a friend diagnose his computer randomly turning off. It turned out to be a CPU fan unplugged itself, causing overheating to trigger a thermal cutoff. No other apparent harm done.

Fans aren't the only means of limiting heat: slowing stuff down and turning stuff off also works. And it turns out users sometimes would rather stuff run slow than run loud, and maybe your crappy motherboard vendor shouldn't be writing a ton of code running in kernel space - with all the potential stability and security issues that might entail - for whatever network-connected bloatware syncs your RGB lighting and fan settings to the cloud. And they will do exactly that, if that's what's required to give their customers what they want.

Just exposing fan RPMs to userspace might be far less dangerous.

2 comments

> and maybe your crappy motherboard vendor shouldn't be writing a ton of code running in kernel space - with all the potential stability and security issues that might entail

This wasn't the suggestion I was making. I was suggesting that the motherboard, itself, should be controlling the fan RPMs (or should at least provide such a mode). I don't feel like taking a temperature input, and mapping that to an RPM output should take much circuitry at all, but it that (somehow) required a full-blown CPU, I was thinking a (very small) auxiliary chip, dedicated to the task.

But yes, if you're going to do it on the main CPU, then in userspace. But now you incur all the problems I mentioned in the original comment, some of which can exhibit death spirals: CPU has to throttle due to heat, meaning less CPU time, meaning it will take longer to get to the code responsible for alleviating the problem of heat by notching the RPM up!

In the worse case, you hit the CPU's critical trip point before the problem can be brought under control.

On typical desktop motherboards, the Super IO chip handles all the temperature monitoring and fan control. Those chips usually have a few modes to configure some very simple control system for mapping temperature inputs to fan speed outputs (never anything as advanced as a PID controller).

The main problem is that the Super IO only has access to the temperature sensors on the motherboard itself, and on the CPU (these days, through PECI). There's no standard way for the Super IO to do out of band monitoring of temperatures on your GPU or storage drives, so if you want those to affect fan speeds you need to implement it in software.

Servers typically have BMCs controlling fans, and even Apple's x86 machines have their SMC; in both cases you typically see a more thorough monitoring of component temperatures, configured out of the box with a proper awareness of which fans are blowing across which components. But that stuff doesn't trickle down to the build your own desktop market.

Hmm. I guess I have no idea how this chip interfaces with the kernel then? The only knobs that seem to be documented to exist ever are direct RPM controls for manual control.

I know it's possible (despite the insistence to the contrary on the other thread), as basically every laptop does it. Only do desktops seem to struggle with this concept.

How fan control is configured depends on the SuperIO chip, so it's different for eg. Fintek SuperIOs than for Winbond/Nuvoton SuperIOs.

On Linux, a supported SuperIO will be exposed as a directory under /sys/class/hwmon. On one of my systems, the SuperIO is a Nuvoton NCT6791, so the relevant driver documentation is https://www.kernel.org/doc/Documentation/hwmon/nct6775

Relevant sysfs files to note are pwm[1-7]_mode to toggle between DC voltage and PWM control, and pwm[1-7]_enable to switch between full speed, pure software speed control, and several Nuvoton-specific automatic speed control modes.

> I was thinking a (very small) auxiliary chip, dedicated to the task.

Extra cost (both in design and manufacture), new potential point of failure (both for manufacture and in the field), when the CPU failing was already a single point of failure for the machine. It's not that a full-blown CPU is required, it's that you already have a full-blown CPU that can do the job, simplifying the design. Well, you might see dedicated fan control hardware for server motherboards and other more industrial focused applications, but they often need to coordinate pumps/fans for an entire building - reliability in this context is more about redundancy, and alerting maintainence to the need for repairs, or perhaps switching to a new primary datacenter if the failure is big enough - not in attempting the impossible of ensuring 100% reliability for any individual component.

> CPU death spirals

Have you actually seen one of these that noticably bogs down the fan-driving firmware/drivers and causing issues? I haven't. I've had fan failures. I've had plenty of hardware controlled fans go full apeshit 100% power to the point of being not merely a nuisance, but a problem (audio, vibration, wear+tear, ...). I've heard of building cooling failures. But I don't think I've seen so much as a blog post about the CPU getting so starved that it can't spin up the CPU fans.

And I've had fans not working hard enough - but I'd rather flip a setting in software than open up a case and go hunting for the right jumper, typically. Less disruptive - and the machine is typically usable enough I can still download/install missing software, and google appropriate documentation, which is frequently a lot more difficult to do with the case open.

---

I guess my main point here is that reliable hardware must already assume the potential for cooling failures, and that extra hardware or engineering for a minor improvement to a "purely theoretical" failure mode doesn't sound like it'd pay for itself.

100% have hardware temperature throttles and cutoffs though. Those cut in for a lot of very real failure modes, that I've only heard of actually happening, but personally experienced. Those will pay for themselves.

> Extra cost (both in design and manufacture), new potential point of failure (both for manufacture and in the field)

"But the BOM" is a tired trope in any discussion about "why can't it do X?". My money would be where my mouth is. But HW manufacturers routinely make design decisions that are simple negatives.

Motherboard manufacturers, in particular. Mobo marketing is focused on colors and flashy effects instead of on quality of production and actual function, beyond a bare minimum of tech specification. AFAICT the products are indistinguishable as no attempt is made to stand out.

> when the CPU failing was already a single point of failure for the machine

… and so is the RAM, the disk, the GPU, the northbridge, the southbridge, the eastbridge…

That's not a valid rationale for "let's a adopt a fundamentally flawed design".

> Have you actually seen one of these that noticably bogs down the fan-driving firmware/drivers and causing issues?

"Fan-driving firmware" is essentially the suggestion. It's what I haven't got.

> I've had fan failures. I've had plenty of hardware controlled fans go full apeshit 100% power to the point of being not merely a nuisance, but a problem (audio, vibration, wear+tear, ...).

No; in my current rigs 10 year lifespan, the fans have been run continuously at 100%, and no fan failures.

The noise, of course, is a nuisance. That's why I'm looking for something that can control the fans in response to temperature, such that the fans can be driven at a temperature-appropriate RPM.

But without doing that in userspace, where the controller might just "die" or effectively die, for any number of reasons, outlined earlier.

> But I don't think I've seen so much as a blog post about the CPU getting so starved that it can't spin up the CPU fans.

I couldn't find any posts reasoning about anything. Either it's not a problem, or just nobody is thinking.

> and that extra hardware or engineering for a minor improvement to a "purely theoretical" failure mode doesn't sound like it'd pay for itself.

Yeah, I was just trying to learn before attempting, literally, "IDK, try it and see if you trip the critical temp cutoff."

Yeah, no doubt the various hardware safeties will save you (although it might put some undue stress on stuff to be at >100℃…) but it might also leave you with a rather hard-to-debug situation if you start experiencing stalls or shutdowns.

… and it's only in desktops that this problem seems to exist. On every laptop I've owned, fan control is in response to temp. (And not handled in userspace, although I presume I could download any of the various fan control programs and have it be that way, but the point is that the HW is, out of the box, doing the sane thing.)

Yup, I got random thermal shutoffs. The pump in the water cooling loop had failed.