| > What are the key assets you monitor beyond the basics like CPU, RAM, and disk usage? * Network is another basic that should be there * Average disk service time * Memory is tricky (even MemAvailable can miss important anonymous memory pageouts with a mistuned vm.swappiness), so also monitor swap page out rates * TCP retransmits as a warning sign of network/hardware issues * UDP & TCP connection counts by state (for TCP: established, time_wait, etc.) broken down by incoming and outgoing * Per-CPU utilization * Rates of operating system warnings and errors in the kernel log * Application average/max response time * Application throughput (both total and broken down by the error rate, e.g. HTTP response code >= 400) * Application thread pool utilization * Rates of application warnings and errors in the application log * Application up/down with heartbeat * Per-application & per-thread CPU utilization * Periodic on-CPU sampling for a bit of time and then flame graph that * DNS lookup response times/errors > Do you also keep tabs on network performance, processes, services, or other metrics? Per-process and over time, yes, which are useful for post-mortem analysis |