Hacker News new | ask | show | jobs
by rudasn 1048 days ago
So about 3 years ago we had a bunch of on prem servers shutting down around March/April. We had even more servers that weren't shutting down so we had to "move fast" before they all had issues.

I must have spent about a week trying to learn just enough about prometheus and grafana (I had used grafana before with influx but for a different purpose) so that we could monitor temperature, memory, cpu, and disk (the bare minimum).

The goal was to have a single dashboard showing these critical metrics for all servers (< 100), and be able to receive email or sms alerts when things turned red.

No luck. After a week I had nothing to show for.

So I turned to Netdata. A one liner on each server and we had super sexy and fast dashboard for each server. No birds eye view, but fine. I then spent maybe 3-4 days trying to figure out how to get alerting to work (just email, but fine) and get temperature readings (or something like that).

No luck. By the end of week 2 I still had nothing, but a bunch of servers shutting down during peak hours.

Week 3 I said fuck it I'll do the stupidest thing and write my own stack. A bunch of shell scripts, deployed via ansible, capturing any metric I could think of, managed by systemd, posting to a $5/month server running a single nodejs service that would do in memory (only) averages, medians etc, and trigger alerts (email, sms, Slack maybe soon) when things get yellow or red.

By week 4 we had monitoring for all servers and for any metric we really needed.

Super cheap, super stable and absolutely no maintenance required. Sure, we probably can't monitor hundreds of servers or thousands of metrics, but we don't need to.

I really wanted to use something else, but I just couldn't :(

3 comments

> So I turned to Netdata. A one liner on each server and we had super sexy and fast dashboard for each server. No birds eye view, but fine. I then spent maybe 3-4 days trying to figure out how to get alerting to work (just email, but fine) and get temperature readings (or something like that).

I work in Netdata on ML. Just wanted to mention that as of last release a parent node will show all children in the agent dashboard so if doing again as of today a parent netdata might have got you the birds eye view as a starting point https://github.com/netdata/netdata/releases/tag/v1.41.0#v141...

(of course we also have Netdata Cloud which would have probably worked too but maybe was not as built out 3 years ago as is now - but don't want to go into sales mode and get blasted :) )

Hey! I subscribe to your github releases and was reading about all that the other day (the parent/child node stuff).

When/If I have the time I'll dig into Netdata some more as I like your approach. :)

I'm not a devops/sre/systems guy, I just do it because I have to, so it's a bit difficult for me to find the time to experiment with these tools.

Cool! - we always looking for feedback, feel free to hop into our discord, forum, or GH discussions (links here: https://www.netdata.cloud/community/) to leave any feedback or ask any questions if you run into any issues.

(cheers for the mention here too - always nice to try get some feedback and discussion going on HN as its so candid :0 )

So .... why were the servers shutting down, and what metric did your own system capture that the others didn't which let you determine that?
Well, at first I was able to gather and correlate enough cpu, temperature, entrypoint data for apparently problematic servers.

The servers were shutting down due to high temperatures caused by persistent high cpu usage.

Knowing that, I installed datadog with APM on just a couple of the servers (because $$) which led me to postgres issues (indexing), weasy pdf generation issues (a python lib), and some bad django code (queryset to list before pagination).

If you have a one-off server running nodejs, you've definitely got maintenance
Why's that?

I think the only time I sshd to that server was last week when I added usb device monitoring and had to docker pull & & docker up -d.

Other than that... Can't remember dealing with the "monitoring stack".