Hacker News new | ask | show | jobs
by deathanatos 1303 days ago
Also, one thing I've always wondered: why do people want to use hwmon to set fan RPMs? (Or really, why do this from userspace at all?) It seems inherently dangerous to me, as you're asking a process who might not receive CPU time for whatever reasons from the very much not realtime OS to control a fan; if the current RPM is too low, and the system starts generating heat, but the fan controlling process doesn't get CPU time … then what happens?

It seems to me you want fans controlled with something dedicated to it.

The other thing I don't get is all the plethora of options my motherboard gives me to set fans only to fixed RPMs. Am I crazy in that I want the fan to be controlled by heat? (More heat => more RPMs. Keep the system cool, but if there isn't much thermal load, spin the fans down and reduce the noise?)

But by fixing an RPM, it seems the only valid input is "100%"; anything else could be too low under stressful conditions.

I could also have a cheap motherboard. (I definitely won't be purchasing from this manufacturer again, and the motherboard does have other severe quality issues…)

3 comments

> why do people want to use hwmon to set fan RPMs? (Or really, why do this from userspace at all?)

Some people want to have a mode switch; normal use should be silent/quiet, but when you know you're going to do something big (game, big compile, etc), fix the fans at full so the noise is consistent and cooling is best. (the cooler the chip, the more the boost)

Some people have no good options from the system firmware, and getting _something_ configurable is better. I've run on systems where I couldn't tell the system to actually run the fan, so things would get hot and throttle. Userspace configurability is better than nothing. This tends to be a bigger issue on things that are sold as a whole computer, like laptops, and small formfactor things (which are often pretty much laptops without a battery and built in user interface devices) but also some name brand desktops.

My recent motherboards all seem to have a pretty nice fan configuration tool. Presets for quiet/performance/full speed, and a simple graph based UI to set % by temperature. Most of the fan headers can be set to follow the cpu temperature or the system temperature. When you buy the nice Noctua fans, they also ship 'low noise adapters' that I assume drop the voltage and limit the maximum RPMs and limit noise. Depending on your overall cooling design, that can be reasonable or asking for trouble.

> Some people want to have a mode switch; normal use should be silent/quiet, but when you know you're going to do something big (game, big compile, etc), fix the fans at full so the noise is consistent and cooling is best. (the cooler the chip, the more the boost)

Yeah, I don't doubt someone is like that … I'd just rather it be automatic.

> My recent motherboards all seem to have a pretty nice fan configuration tool.

Mine has a "flashy" tool, I would say. Certainly looks pretty, but again, it's all constant RPM options.

As I lament in the other thread, this is something that would differentiate boards at time of purchase, but no mobo manufacturers marketing dept. seems to have it's shit together enough to get such a differentiation across to the consumer. Instead the focus seems to be completely on the aesthetics of how the board looks.

And again, I've chalked this up to having chosen poorly. But there-in is the problem: assuming I chose poorly, assuming some mobos do support sane defaults/fetures … how do I end up finding and purchasing one of those? Any knowledge I acquire during a purchase is useless the next time around, given the constant product churn HW manufacturers nonsensically do.

The two boards I've gotten recently advertise the features:

https://www.gigabyte.com/Motherboard/A520I-AC-rev-1 look for "Smart Fan 5", there's a tab you can click and see what the customization UI looks like (it's in the firmware settings usable with keyboard or mouse). ITX does mean this isn't a 'value' board, but when I got it the premium above mATX wasn't that much (and probably mostly went to the wireless I don't really need and barely ever use)

My other board is a bit more upmarket https://www.asrock.com/mb/AMD/B550M%20Pro4 it doesn't show anything on the marketting page, but in specifications it mentions "Smart Fan Speed Control" and the UI to configure it is pretty similar.

You get to set about 5 temp -> fan % settings and I can keep things cool without being noisy until I've got sustained load and then it's noisy and warm anyway. The ITX systems can only do so much with a cooler + heatsink height of 36mm (at 37mm the fan housing touches the mesh side panel), and the b550 currently has an anemic Wraith Stealth. Even with 65w target chips, that's not enough to keep them below 90C at high load.

I think my older boards have basic quiet/loud/full speed settings but not detailed ramp settings; but it's been a while and they're either hidden in the basement/garage or not used often so I didn't care about noise. :D

> It seems inherently dangerous to me

In the old days, it kinda was - at least to your hardware.

Then people realized that blowing up components because a fan failed, or became unplugged, or a filter clogged with dust, maybe wasn't a great user experience and/or caused more in-warranty returns that required replacing hardware ($$expensive$$!), and implemented thermal throttling and thermal cutoffs. Nearly two decades ago at this point, I helped a friend diagnose his computer randomly turning off. It turned out to be a CPU fan unplugged itself, causing overheating to trigger a thermal cutoff. No other apparent harm done.

Fans aren't the only means of limiting heat: slowing stuff down and turning stuff off also works. And it turns out users sometimes would rather stuff run slow than run loud, and maybe your crappy motherboard vendor shouldn't be writing a ton of code running in kernel space - with all the potential stability and security issues that might entail - for whatever network-connected bloatware syncs your RGB lighting and fan settings to the cloud. And they will do exactly that, if that's what's required to give their customers what they want.

Just exposing fan RPMs to userspace might be far less dangerous.

> and maybe your crappy motherboard vendor shouldn't be writing a ton of code running in kernel space - with all the potential stability and security issues that might entail

This wasn't the suggestion I was making. I was suggesting that the motherboard, itself, should be controlling the fan RPMs (or should at least provide such a mode). I don't feel like taking a temperature input, and mapping that to an RPM output should take much circuitry at all, but it that (somehow) required a full-blown CPU, I was thinking a (very small) auxiliary chip, dedicated to the task.

But yes, if you're going to do it on the main CPU, then in userspace. But now you incur all the problems I mentioned in the original comment, some of which can exhibit death spirals: CPU has to throttle due to heat, meaning less CPU time, meaning it will take longer to get to the code responsible for alleviating the problem of heat by notching the RPM up!

In the worse case, you hit the CPU's critical trip point before the problem can be brought under control.

On typical desktop motherboards, the Super IO chip handles all the temperature monitoring and fan control. Those chips usually have a few modes to configure some very simple control system for mapping temperature inputs to fan speed outputs (never anything as advanced as a PID controller).

The main problem is that the Super IO only has access to the temperature sensors on the motherboard itself, and on the CPU (these days, through PECI). There's no standard way for the Super IO to do out of band monitoring of temperatures on your GPU or storage drives, so if you want those to affect fan speeds you need to implement it in software.

Servers typically have BMCs controlling fans, and even Apple's x86 machines have their SMC; in both cases you typically see a more thorough monitoring of component temperatures, configured out of the box with a proper awareness of which fans are blowing across which components. But that stuff doesn't trickle down to the build your own desktop market.

Hmm. I guess I have no idea how this chip interfaces with the kernel then? The only knobs that seem to be documented to exist ever are direct RPM controls for manual control.

I know it's possible (despite the insistence to the contrary on the other thread), as basically every laptop does it. Only do desktops seem to struggle with this concept.

How fan control is configured depends on the SuperIO chip, so it's different for eg. Fintek SuperIOs than for Winbond/Nuvoton SuperIOs.

On Linux, a supported SuperIO will be exposed as a directory under /sys/class/hwmon. On one of my systems, the SuperIO is a Nuvoton NCT6791, so the relevant driver documentation is https://www.kernel.org/doc/Documentation/hwmon/nct6775

Relevant sysfs files to note are pwm[1-7]_mode to toggle between DC voltage and PWM control, and pwm[1-7]_enable to switch between full speed, pure software speed control, and several Nuvoton-specific automatic speed control modes.

> I was thinking a (very small) auxiliary chip, dedicated to the task.

Extra cost (both in design and manufacture), new potential point of failure (both for manufacture and in the field), when the CPU failing was already a single point of failure for the machine. It's not that a full-blown CPU is required, it's that you already have a full-blown CPU that can do the job, simplifying the design. Well, you might see dedicated fan control hardware for server motherboards and other more industrial focused applications, but they often need to coordinate pumps/fans for an entire building - reliability in this context is more about redundancy, and alerting maintainence to the need for repairs, or perhaps switching to a new primary datacenter if the failure is big enough - not in attempting the impossible of ensuring 100% reliability for any individual component.

> CPU death spirals

Have you actually seen one of these that noticably bogs down the fan-driving firmware/drivers and causing issues? I haven't. I've had fan failures. I've had plenty of hardware controlled fans go full apeshit 100% power to the point of being not merely a nuisance, but a problem (audio, vibration, wear+tear, ...). I've heard of building cooling failures. But I don't think I've seen so much as a blog post about the CPU getting so starved that it can't spin up the CPU fans.

And I've had fans not working hard enough - but I'd rather flip a setting in software than open up a case and go hunting for the right jumper, typically. Less disruptive - and the machine is typically usable enough I can still download/install missing software, and google appropriate documentation, which is frequently a lot more difficult to do with the case open.

---

I guess my main point here is that reliable hardware must already assume the potential for cooling failures, and that extra hardware or engineering for a minor improvement to a "purely theoretical" failure mode doesn't sound like it'd pay for itself.

100% have hardware temperature throttles and cutoffs though. Those cut in for a lot of very real failure modes, that I've only heard of actually happening, but personally experienced. Those will pay for themselves.

> Extra cost (both in design and manufacture), new potential point of failure (both for manufacture and in the field)

"But the BOM" is a tired trope in any discussion about "why can't it do X?". My money would be where my mouth is. But HW manufacturers routinely make design decisions that are simple negatives.

Motherboard manufacturers, in particular. Mobo marketing is focused on colors and flashy effects instead of on quality of production and actual function, beyond a bare minimum of tech specification. AFAICT the products are indistinguishable as no attempt is made to stand out.

> when the CPU failing was already a single point of failure for the machine

… and so is the RAM, the disk, the GPU, the northbridge, the southbridge, the eastbridge…

That's not a valid rationale for "let's a adopt a fundamentally flawed design".

> Have you actually seen one of these that noticably bogs down the fan-driving firmware/drivers and causing issues?

"Fan-driving firmware" is essentially the suggestion. It's what I haven't got.

> I've had fan failures. I've had plenty of hardware controlled fans go full apeshit 100% power to the point of being not merely a nuisance, but a problem (audio, vibration, wear+tear, ...).

No; in my current rigs 10 year lifespan, the fans have been run continuously at 100%, and no fan failures.

The noise, of course, is a nuisance. That's why I'm looking for something that can control the fans in response to temperature, such that the fans can be driven at a temperature-appropriate RPM.

But without doing that in userspace, where the controller might just "die" or effectively die, for any number of reasons, outlined earlier.

> But I don't think I've seen so much as a blog post about the CPU getting so starved that it can't spin up the CPU fans.

I couldn't find any posts reasoning about anything. Either it's not a problem, or just nobody is thinking.

> and that extra hardware or engineering for a minor improvement to a "purely theoretical" failure mode doesn't sound like it'd pay for itself.

Yeah, I was just trying to learn before attempting, literally, "IDK, try it and see if you trip the critical temp cutoff."

Yeah, no doubt the various hardware safeties will save you (although it might put some undue stress on stuff to be at >100℃…) but it might also leave you with a rather hard-to-debug situation if you start experiencing stalls or shutdowns.

… and it's only in desktops that this problem seems to exist. On every laptop I've owned, fan control is in response to temp. (And not handled in userspace, although I presume I could download any of the various fan control programs and have it be that way, but the point is that the HW is, out of the box, doing the sane thing.)

Yup, I got random thermal shutoffs. The pump in the water cooling loop had failed.
you might want to use different algorithms for automatic control than those exposed through the motherboard firmware, which are pretty much limited to either constant, manual setpoints (usually with extrapolation), and a linear temp-to-rpm curve (ideally with different possible sources).

if you look at my linked Counterforce project, i believe there to be value beyond these simplest management schemes. we'll see.