Hacker News new | ask | show | jobs
by otterley 4509 days ago
Swap rate (as opposed to space consumed) is probably the #1 metric that monitoring agents fail to report.

One thing that drives me nuts is how frequently monitoring agents/dashboards report and graph only free memory on Linux, which gives misleading results. It's fine to report it, but to make sense of it, you have to stack free memory along with cached and buffered memory, if you care about what's actually available for applications to use.

Another often-overlooked metric that's important for web services in particular is the TCP accept queue depth, per listening port. Once the accept queue is drained, remote clients will get ECONNREFUSED, which is a bad place to be. This value is somewhat difficult to attain, though, because AFAIK Linux doesn't expose it.

4 comments

> One thing that drives me nuts is how frequently monitoring agents/dashboards report and graph only free memory on Linux, which gives misleading results. It's fine to report it, but to make sense of it, you have to stack free memory along with cached and buffered memory, if you care about what's actually available for applications to use.

Even that is misleading. It's actually non-trivial to find out exactly how much "freeable" memory one has on a linux system these days as not all the cached memory bits are truly freeable.

Even then there's some wrinkles; the anon shared memory used by e.g. the Oracle SGA will show up as cached memory, but evicting it is a no-no.
Yes I can't find the socket backlog anywhere in Linux. FreeBSD exposes it via kqueue http://www.freebsd.org/cgi/man.cgi?query=kqueue through the data item in EVFILT_READ.
With FreeBSD it's even easier; you can use "netstat -L".
Swap rate still looks like the wrong metric. It'd be better to have the rate of swap lookups, excluding all writes.
swap-in rate, to be more specific. swap-outs aren't incredibly worrisome.
That's backwards: things like mmap() will generate page-in activity during normal operation. page-outs means that the operating system had to evict something to satisfy other memory requests, which is what you really want to know.
swapouts and pageouts aren't identical in Linux, and are instrumented separately (pswpout and pgpgout, respectively; see /proc/vmstat). mmap() and other page-ins won't be counted under the swap statistics.

A pageout might suggest memory pressure, but not nearly as much as a swapout does. (pgmajfault is a better indicator.) Writing dirty pages is just something the kernel does even when there's no memory pressure at all. Also, unfortunately you can't use pgpgout for anything useful as ordinary file writes are counted there.