Hacker News new | ask | show | jobs
by jesse_cureton 1313 days ago
Preface: my knowledge here is on ARM, particularly baremetal, but also embedded Linux. No idea about Windows or x86.

Generally there’s a hardware watchdog implemented as a counter/timer in the processor. It can have a predefined or configurable period. It counts down, and if it times out then it initiates a hardware reset of the processor.

You can ensure your software/OS is always at least executing code by having a task (in-kernel on Linux, or an RTOS task, or just in your main event loop on baremetal) that resets that timer. Then, if your code stops resetting that timer, it expires and resets the processor.

3 comments

A more specialised variant that's also quite common is the "window watchdog" peripheral, which is similar to the timer version, but will also trigger a reset if the keep-alive signal arrives too early, as well as too late.

It can be useful where you've got a mainloop doing some very predictably timed activities, and allows detection of faults which cause your watchdog servicing to occur too frequently.

I think it's quite common in DSP and things like motor control, where you often have hard realtime requirements and things happening too soon is just as bad as too late.

Is there some way of accessing this from user space on Linux?
Yes! There’s an ioctl interface for managing the watchdog, and a character device at /dev/watchdog. The kernel docs[1] are a decent jumping off point to learn more.

Upon reading these I did realize on Linux it’s implemented as a kernel device, but it’s usually a userspace task that has to notify the kernel watchdog interface to actually kick the timer. This makes sense, since userspace being functional is probably what you really care about.

[1] https://www.kernel.org/doc/html/latest/watchdog/watchdog-api...

One of the preternal problems of such hardware watchdogs was the inability to discriminate whether a sudden reboot was due a reset-button, hw security (e.g. temperature), ECC problem, or (micro) loss of power, or HW watchdog.

On most IPMI-capable BIOS/firmware there's now (been for 10 years but I'm old) an option to log 'system' events (ipmi failures like fan speeds if you've set threshold, but also reboot reasons). It's call the System Event Log. Very useful.

And on IPMI-plugged watchdogs, you can also see the state of the HW watchdog (is it running, how many seconds are left). Very useful too.

In addition to those already mentioned, one way is to enable it in systemd:

  # /etc/systemd/system.conf.d/foobar.conf
  [Manager]
  RuntimeWatchdogSec=60
When used in this manner, if systemd fails to ping the watchdog for 60 seconds, the system resets.

https://www.freedesktop.org/software/systemd/man/systemd-sys...

Somewhat related, nowadays by default systemd enables a 10-minute watchdog just before a regular reboot (i.e. after everything has been shut down) to ensure the reboot happens even if there is a hang for some kernel/HW reason.

https://www.kernel.org/doc/html/latest/watchdog/watchdog-api...

Many x86 systems have a built in hardware watchdog.

There is a caveat here, it won't stop your app from crashing before the watchdog activation. Some CPUs have fuse that can enable watchdog before any code starts running but the ARMs I played with (STM32) don't appear to have that option.
At least some STM32s do, see page 89 of the STM32F4xx reference manual[1], the option bits 5:7 at 0x1fffc000 let you activate the hardware watchdog immediately following reset if you wish.

[1] https://www.st.com/resource/en/reference_manual/rm0090-stm32...