Hacker News new | ask | show | jobs
by PhilipRoman 1057 days ago
I was building an elaborate job monitoring system, but then I realized that what I really need is monitoring the actual end to end functionality.

For example, instead of monitoring my Minecraft server process that OpenRC spawns, I have a dedicated monitoring server that actually queries the server for version, number of players, etc. Same for websites, etc. Think of it as periodically running an integration test on a live system.

This way I get much more confidence that the service is doing what it should.

I'm not a big fan of over complicated monitoring systems - I simply have a script that builds a HTML status page with enough information to know when something goes wrong.

1 comments

Broad alerts are really good to have. Narrow metrics are great to have once something goes wrong. When a server does go down, what did CPU, memory, disk IO look like? Did the request count climb quickly before the outage? Having those other metrics help for speedy troubleshooting -- Is it a software problem that got out of control or did some piece of hardware die or get throttled?

I'm of the opinion that having charts and graphs to rely on can focus troubleshooting resources more quickly onto the most actionable areas.