Hacker News new | ask | show | jobs
by johnsbrayton 1038 days ago
Great article. While it mentions monitoring, it took me a long time to appreciate how beneficial it is to do monitoring really well. Things like:

• Knowing when disk space, inode usage, or memory usage get high, long before it’s an emergency.

• Automated monitoring of SSL certificate expiration dates, letting you know days before a certificate expires. Whether or not you use something like certbot, have a separate process that automatically tells you a certificate is close to expiration.

• Automated periodic end-to-end testing of moving parts. Like if you run an email server, a process that sends something from your server to a gmail.com address, and then checks the gmail.com inbox to find the message.

• Automated periodic testing that unexposed ports remain unavailable from outside the device or private network.

• Automated checking that a Linux instance is successfully checking for and installing security updates, and is not waiting for a reboot. • Automated checking that backups are working as expected. You might not be able to automate periodic restore testing, but at least check that backups do not appear to be silently failing. • Separating out low priority alerts from high priority alerts. You want to get woken up when necessary, but not for an issue that can wait until you are at your desk.

2 comments

Aside from (and secondary to) monitoring, one thing it took me years to realize the benefits and ease of setting up early and i think other selfhosters commonly neglect: caching proxies and removing default internet routes.

Benefits include:

- Security

- Ease of configuring traffic control: As long as you're not redirecting UDP (have fun lol), steering apps with HTTP or SOCKS5 forward-proxies is so much more straightforward than routing.

- Performance/effieciency (global package cache for your network!)

- Resilience (apt upgrades and docker image pulls can keep working despite your entire network being offline)

My rough starting kit for a Linux-based network here would be:

- Some caching forwarding internal DNS server. If you already have an internal recursor or forwarder great, but it's good to let the DNS server serving your clients be separate anyway. dnsmasq/unbound/technitium/coredns/powerdns/yadifa.

- Internal NTP for syncing time. May be provided by your DNS or DHCP server already. chrony is good.

- apt-cacher-ng or other caching forward HTTP proxy for your apt/dnf/pacman/apk/whathaveyou updates.

- docker-registry-server in mirror mode and set up as mirror for any docker/podman hosts you have.

Do you have any recommendations or resources you think are great for learning more about this? I think I’m right at the beginning of this journey and looking for where to start.
I wish I did. My approach is that I have a ruby script that runs every five minutes and does a bunch of tests. The script takes a couple minutes to execute. It connects to servers via SSH to check things out, does end-to-end-tests, then it writes its result to a JSON file.

It runs on a Linode instance with a webapp whose sole responsibility is to respond to Pingdom requests. There are two URLs that Pingdom looks for: one that returns a 500 if the JSON file indicates an issue that warrants texting me. A second that returns a 500 if the JSON file indicates an issue that warrants emailing me for a lower priority issue. Pingdom is configured accordingly.

If for any reason the JSON file has not been written in the past 10 minutes (?) or cannot be read and parsed, both URLs return a 500.

The script has a log file, so when I get an alert I can check the log file to determine what is wrong.

This is likely atypical, but it works really well for me. My scripts do the work of monitoring the heck out of everything. I only need Pingdom (or a service like it) to monitor two URLs and do the texting/emailing.

But my overall approach is to think of monitoring like unit tests or integration tests: when I think of something that could go wrong, I try to make sure there is monitoring that can detect it and alert me. When possible, before it becomes urgent. And when something does go wrong that is not automatically detected, it's a high priority to add monitoring around that.