Hacker News new | ask | show | jobs
by jonatron 480 days ago
At a previous company, devops would regularly find CPU fan failures on Hetzner. That's in addition to the usual expected HD/SSD failures. You've got to do your own monitoring, it's one of the reasons why unmanaged servers are cheaper than cloud instances.
3 comments

I regularly find broken thermal solutions in azure and when I worked at Google it was also a low-level but constant irritant. When I joined Dropbox I said to my team on my first day that I could find a machine in their fleet running at 400MHz, and I was right: a bogus redundant PSU controller was asserting PROCHOT. These things happen whenever you have a lot of machines.
The term PROCHOT just brought me back to vivid memories of debugging exactly that at Facebook a while ago.

It was very non-obvious to debug since pretty much most emitted metrics, apart from mysterious errors/timeouts to our service, looked reasonable. Even the cpu usage and cpu temperature graphs looked normal since it was a bogus prochot and not actually a real thermal throttling

And it brought me back to memories of debugging that on my friends laptop.

It kept going to 400mhz.. i suspected throttling and we got it cleaned thermal paste replaced and all that.

Still throttled. We replaced the windows with linux since it was atleast a bit more usable

At the time I didn't know about PROCHOT. And my googling skills clearly weren't sufficient.

One fine day during lunch at a place on campus, Id read about BD_PROCHOT recently. So i wrote a script to msrprobe or whatever it was and disabled it. "Extended" the lifespan of the thing.

I once had a dell laptop that after about three years started complaining on power up that I wouldn't be using a genuine dell PSU and should switch to one. I ignored it at first because you could just hit enter and carry on, but after a while I noticed that every time this happened, the cpu would clock at a fixed 800mhz. I ordered a new power brick but the message didn't go away, so I returned the brick and decided to never buy dell again.
A laptop that I had would assert PROCHOT if it didn't like the power supply you plugged into it. It actually took an embarrassing amount of time for me to notice that this is what was causing Slack to be inexplicably slower at my desk than when I was out working in a common area in the building.
in my (limited) experience this only happened with GIGABYTE servers

very weird behavior, I'd prefer my servers to crash instead of lowering frequency to 400MHz.

I've seen it on nearly every brand, I have some Lenovo Servers in the basement that also down-clock if both PSU's aren't installed.

I have alerts on PSU's and frequency for this reason.

The servers are so cheap that overcommitting them by double is still significantly cheaper than using cloud hosting, which tends to have the same issue only monitoring it is harder. Though most people using cloud seem to be happy not to know and it's been a known thing that there's a 5x variation between instances of the same size on AWS.: https://www.brendangregg.com/Slides/AWSreInvent2017_performa...

> I'd prefer my servers to crash instead of lowering frequency to 400MHz.

100% agreed. There is nothing worse than a slow server in your fleet. This behavior reeks of "pet" thinking.

Stuff like this just comes up from time to time as soon as you run a four digit and up number of systems.
No? Maybe you cloud kids don't know how this stuff works, but unmanaged just means you get silicon-level access and remote KVM.

It's still the hosting company's responsibility to competently own, maintain, and repair the physical hardware. That includes monitoring. In the old days you had to run a script or install a package to hook into their monitoring....but with IPMI et al being standard they don't need anything from you to do their job.

The only time a hosting company should be hands-off is when they're just providing rack space, power, and data. Anything beyond that is between you and them in a contract/agreement.

Every time I hear Hetzner come up in the last few years it's been a story about them being incompetent. If they're not detecting things like CPU fan failures of their own hardware and they deployed new systems without properly testing them first, then that's just further evidence they're still slipping.

> No? Maybe you cloud kids don't know how this stuff works, but unmanaged just means you get silicon-level access and remote KVM.

That's one way it can work. There are a great many hosted server options out there from fully managed to fully unmanaged with price points to match. Selling a cheap server under the conditions "call us when it breaks" is a perfectly reasonable offering.

Alright, let's say the hosting company has an out-of-band mechanism for detecting reboots. How do they know if the reboots are abnormal (like in this case) or normal, customer-ordered reboots after software upgrades?
Probably covered here:

> In the old days you had to run a script or install a package to hook into their monitoring....but with IPMI et al being standard they don't need anything from you to do their job

How can IPMI detect the cause (kernel panic vs user command) for restart?
Do Hetzner servers even run IPMI?

For dedicated servers, you have to schedule KVM access in advance, so I assume they need to move some hardware and plug into to your server.

This would mean that IPMI is most likely not available or disabled.

Not anymore, but you can abuse pstore to know about last messages from before reboot
I'm heavily against both relying on free dependencies and going for the cheapest option.

If you can't put yourself in the shoes for a second when evaluating a purchase and you just braindead try to make cost go lower and income go higher, your ngmi except in shady sales businesses.

Server hardware is incredibly cheap, if you are somewhat of a competent programmer you can handle most programs in a single server or even a virtual machine. Just give them a little bit of margin and pay 50$/mo instead of 25$/mo, it's not even enough to guarantee they won't go broke or make you a valuable customer, you'll still be banking on whales to make the whole thing profitable.

Also, if your business is in the US, find a US host ffs.