|
|
|
|
|
by Cyph0n
1111 days ago
|
|
I recently setup basic monitoring using Telegraf + Influx + Grafana. Here are the alert triggers, in order of importance (imo): * ZFS pool errors. Motivator: one of my HDDs failed and it took me a few days to notice. The pool (raidz1) kept chugging along of course. * HDD and SSD SMART errors * High HDD and SSD temperatures * ZFS pool utilization * High CPU temperature. Motivator: one of my case fans failed and it took a while for me to notice. * High GPU temperatures. Motivator: I have two GPUs in my tower, one of which I don't really monitor (used for transcoding). * High (sustained) CPU usage. I track this at the server level, rather than for individual VMs. |
|