| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rudasn 1095 days ago

So about 3 years ago we had a bunch of on prem servers shutting down around March/April. We had even more servers that weren't shutting down so we had to "move fast" before they all had issues.

I must have spent about a week trying to learn just enough about prometheus and grafana (I had used grafana before with influx but for a different purpose) so that we could monitor temperature, memory, cpu, and disk (the bare minimum).

The goal was to have a single dashboard showing these critical metrics for all servers (< 100), and be able to receive email or sms alerts when things turned red.

No luck. After a week I had nothing to show for.

So I turned to Netdata. A one liner on each server and we had super sexy and fast dashboard for each server. No birds eye view, but fine. I then spent maybe 3-4 days trying to figure out how to get alerting to work (just email, but fine) and get temperature readings (or something like that).

No luck. By the end of week 2 I still had nothing, but a bunch of servers shutting down during peak hours.

Week 3 I said fuck it I'll do the stupidest thing and write my own stack. A bunch of shell scripts, deployed via ansible, capturing any metric I could think of, managed by systemd, posting to a $5/month server running a single nodejs service that would do in memory (only) averages, medians etc, and trigger alerts (email, sms, Slack maybe soon) when things get yellow or red.

By week 4 we had monitoring for all servers and for any metric we really needed.

Super cheap, super stable and absolutely no maintenance required. Sure, we probably can't monitor hundreds of servers or thousands of metrics, but we don't need to.

I really wanted to use something else, but I just couldn't :(

3 comments

andrewm4894 1095 days ago

> So I turned to Netdata. A one liner on each server and we had super sexy and fast dashboard for each server. No birds eye view, but fine. I then spent maybe 3-4 days trying to figure out how to get alerting to work (just email, but fine) and get temperature readings (or something like that).

I work in Netdata on ML. Just wanted to mention that as of last release a parent node will show all children in the agent dashboard so if doing again as of today a parent netdata might have got you the birds eye view as a starting point https://github.com/netdata/netdata/releases/tag/v1.41.0#v141...

(of course we also have Netdata Cloud which would have probably worked too but maybe was not as built out 3 years ago as is now - but don't want to go into sales mode and get blasted :) )

link

rudasn 1095 days ago

Hey! I subscribe to your github releases and was reading about all that the other day (the parent/child node stuff).

When/If I have the time I'll dig into Netdata some more as I like your approach. :)

I'm not a devops/sre/systems guy, I just do it because I have to, so it's a bit difficult for me to find the time to experiment with these tools.

link

andrewm4894 1095 days ago

Cool! - we always looking for feedback, feel free to hop into our discord, forum, or GH discussions (links here: https://www.netdata.cloud/community/) to leave any feedback or ask any questions if you run into any issues.

(cheers for the mention here too - always nice to try get some feedback and discussion going on HN as its so candid :0 )

link

mike_hearn 1095 days ago

So .... why were the servers shutting down, and what metric did your own system capture that the others didn't which let you determine that?

link

rudasn 1095 days ago

Well, at first I was able to gather and correlate enough cpu, temperature, entrypoint data for apparently problematic servers.

The servers were shutting down due to high temperatures caused by persistent high cpu usage.

Knowing that, I installed datadog with APM on just a couple of the servers (because $$) which led me to postgres issues (indexing), weasy pdf generation issues (a python lib), and some bad django code (queryset to list before pagination).

link

droopyEyelids 1095 days ago

If you have a one-off server running nodejs, you've definitely got maintenance

link

rudasn 1095 days ago

Why's that?

I think the only time I sshd to that server was last week when I added usb device monitoring and had to docker pull & & docker up -d.

Other than that... Can't remember dealing with the "monitoring stack".

link