Hacker News new | ask | show | jobs
by mike_hearn 1054 days ago
So .... why were the servers shutting down, and what metric did your own system capture that the others didn't which let you determine that?
1 comments

Well, at first I was able to gather and correlate enough cpu, temperature, entrypoint data for apparently problematic servers.

The servers were shutting down due to high temperatures caused by persistent high cpu usage.

Knowing that, I installed datadog with APM on just a couple of the servers (because $$) which led me to postgres issues (indexing), weasy pdf generation issues (a python lib), and some bad django code (queryset to list before pagination).