|
|
|
|
|
by knweiss
3697 days ago
|
|
Brendan, did you read this awesome blog post (which mentions your work btw) about Google's tracing framework which may explain the kind of problems they want to solve? http://danluu.com/perf-tracing/ "Sampling profilers, the most common performance debugging tool, are notoriously bad at debugging problems caused by tail latency because they aggregate events into averages. But tail latency is, by definition, not average." |
|
As for tail latency: one way is to dump every function event ; a lower cost way is to summarize latency in-kernel as a histogram. For both you're still tracing every function, and care about overhead down to 100 cycles etc.
But think about what causes the tails in the first place. Lock contention? Trace that. Resource I/O? Trace that. If it costs 1 us to trace disk I/O, it's usually not a problem. I like to time scheduler switch events with stack traces -- a catch all (but a bit more expensive). Of course, these approaches require kernel instrumentation. :)