| We've been developing the BlueWave Uptime Manager [1] for the past 5 months with a team of 7 developers and 3 external contributors, and till today we always went under the radar. As we move towards expanding from basic uptime tracking to a comprehensive monitoring solution, we're interested in getting insights from the community. For those of you managing server infrastructure, - What are the key assets you monitor beyond the basics like CPU, RAM, and disk usage? - Do you also keep tabs on network performance, processes, services, or other metrics? Additionally, we're debating whether to build a custom monitoring agent or leverage existing solutions like OpenTelemetry or Fluentd. - What’s your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution? - Lastly, what’s your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system? [1] https://github.com/bluewave-labs/bluewave-uptime |
* Network is another basic that should be there
* Average disk service time
* Memory is tricky (even MemAvailable can miss important anonymous memory pageouts with a mistuned vm.swappiness), so also monitor swap page out rates
* TCP retransmits as a warning sign of network/hardware issues
* UDP & TCP connection counts by state (for TCP: established, time_wait, etc.) broken down by incoming and outgoing
* Per-CPU utilization
* Rates of operating system warnings and errors in the kernel log
* Application average/max response time
* Application throughput (both total and broken down by the error rate, e.g. HTTP response code >= 400)
* Application thread pool utilization
* Rates of application warnings and errors in the application log
* Application up/down with heartbeat
* Per-application & per-thread CPU utilization
* Periodic on-CPU sampling for a bit of time and then flame graph that
* DNS lookup response times/errors
> Do you also keep tabs on network performance, processes, services, or other metrics?
Per-process and over time, yes, which are useful for post-mortem analysis