| bpftrace was instrumental in helping me diagnose a latency spike on a hot path. I call into a vendor library that eventually calls into Solarflare's TCPDirect API to send packets to the network. This call usually stays under 5us and has very little variance, but I was seeing the occasional 30ms or 60ms outlier. I attached probes to track the latency of each call down the stack and it became quite easy to pin down the specific function. Eventually tracking it down to a single assignment to a variable that was mmapped to a file, and getting spikes due to page cache writeback. After figuring it out I found these two articles that go into more detail on the problem. https://rigtorp.se/virtual-memory/ https://tungdam.medium.com/our-lessons-on-linux-writeback-do... Here's one of the probes producing a histogram of the latency of the call to the TCPDirect function, which was surprising in how simple it was. sudo bpftrace -p "$PID" -e '
// —— TCPDirect ——
uprobe:/lib/x86_64-linux-gnu/libonload_zf.so.1:zft_send_single {
@zf_start[tid] = nsecs;
}
uretprobe:/lib/x86_64-linux-gnu/libonload_zf.so.1:zft_send_single /@zf_start[tid]/ {
$lat = nsecs - @zf_start[tid];
@h_zft = hist($lat / 1000);
delete(@zf_start[tid]);
}
'
|