Hacker News new | ask | show | jobs
by saravana87 3425 days ago
TCP retransmission rates looks like a useful metric which can help in monitoring the health of a service. One way to obtain that is by analyzing service interactions as mentioned in the blog. Tracing could be another way through which we can find that info. I am curious as to how code instrumented monitoring solutions get that information. (PS: I work for Netsil)
2 comments

By default you can only get that per-kernel from /proc/net/netsnmp. BPF may allow something more granular.

The other way of approaching it is to look for the additional latency it causes, which you can spot on a per-service basis.

Additional latency could be an indicator, but there's no guarantee that it is because of retransmissions ?
If you look at your latency histogram and are seeing a bump at around 200ms above normal (which was the default minimum wait time a few years back anyway), it's probably retransmits.
Got it.
you can get retransmits from 'sar' on linux
I see. But, it looks like it is per host and there is no way to find out for a particular service running on the host.