Hacker News new | ask | show | jobs
by pikahumu 1056 days ago
The reasonable way to notice is to have alerts for any unexpected restarts. Relying on noticing intermittent service disruption is bound to fail. And so is "remembering to check for this":

> in the future I'm going to want to remember to check for this

Whenever you think that sentence, you should notice this as a red flag and re-think your approach. You will forget. And if not you, then somebody else in your team. You need automation for things you can forget, otherwise your mental checklists will grow too large to handle and are just a distraction.

7 comments

> The reasonable way to notice is to have alerts for any unexpected restarts. Relying on noticing intermittent service disruption is bound to fail.

I'd argue that unexpected restarts should alert beyond a threshold. Alerting on every occurrence is too noisy. If an individual unit failure causes a service disruption architecture improvements are needed.

Don't you want to know if something restarts unexpectedly? It's a bug that should be understood and fixed. (If it's not a bug then it's not unexpected.)
This is good advice imho.

Away from things that can be automated and are important you have a checklist and tick things off with a pen. Add $this to the list.

Atul Gawande is worth reading in general and on this topic. [1] He turned it into a book I haven't yet read.

[1] https://www.newyorker.com/magazine/2007/12/10/the-checklist

> The reasonable way to notice is to have alerts for any unexpected restarts.

Yes, and to have alerts, I have resorted to write my own small tools [0][1] at the end of the day. Railgun proved to be very useful, but for smaller things, I'm writing the second one.

[0]: https://git.sr.ht/~bayindirh/railgun

[1]: https://sr.ht/~bayindirh/nudge/

On the topic of alerts/notifications, does anyone know of a project or method for minimal selfhosted cross device notifications? Say I run a daemon on my home server which listens on a Unix or inet socket(s) for generic JSON messages which could come from anywhere, like an IRC plugin, bash command wrapper, or anything I can think of that I want to get a notification for (I choose JSON over sockets so that is quick to implement and can be easily embedded in whatever random environment that I can expose the socket to, even sandboxed ones with no internet connection). Then it exposes another socket where a daemon on my desktop and laptop can receive notifications from, and convert it to a notify-send command so that it's displayed on the desktop. And since the daemon on the home server saves the messages in say an sqlite database, my desktop/laptop can fetch messages and I can print a backlog if they were offline. Otherwise, I'll probably write something like this because I'd find it useful. Email seems to be the closest thing but I don't want to rely on an external service or run a whole internal mail server.
In a kubernetes environment, systemd auto-restarting a process inside of a container can hide problems like the article says - if it successfully restarts the process before a liveness probe can pick it up, even if you monitor something like container restarts, you could easily miss this.
Are you talking about running systemd inside a container? Feels like an unnecessary layer.
What is the best way to set up those alerts? Specifically how do you set something up that knows if unexpected restarts happened? Is there a dbus or similar event you can listen for?
Worth looking into prometheus. in basic form that'd be gathering metrics from https://github.com/prometheus-community/systemd_exporter on your hosts, and configure alerting in grafana or prometheus alertmanager to notify when a threshold is exceeded
But if you have too many alerts, you have to remember to check those, and the temptation is to skim over them, and then you'll inevitably miss stuff. Then you need an alert system to alert you to the really important alerts.