|
|
|
|
|
by Sirened
1522 days ago
|
|
Can I ask an office hours type question? I worked on a very similar (if not identical lol) project at a job once upon a time and the biggest problem I had (and one that I never really solved well) was recovering call stacks from trace data. I effectively ended up using DWARF and just simulating execution and keeping a call stack in the decoder. This mostly worked fine for small and simple programs, but I ran into SO MUCH trouble because I found that (at least on my generation of cores) IPT actually overflows and drops packets very frequently if you have too many calls/returns too quickly. This is largely not an issue for C code but once you start getting into more dynamic languages with fancy features, IPT cannot keep up. Once packets get dropped, the entire call stack for the entire rest of the thread is ruined since you have no idea who called/returned in the dropped packets. One option that we had but didn't really chase down due to time was maybe combining IPT with low frequency stack traces so that we can both just reset every so often and, if needed, work backwards/apply heuristics in order to arrive at that next callstack. How did y'all manage this? Your call stacks look totally correct and I'm very impressed :) |
|
- I imagine the extra memory bandwidth of newer parts doesn't hurt. The example traces were taken on server-class Ice Lake machines. They just don't overflow for our typical workloads.
- We found the specific IPT configuration matters a lot. Turning off return compression is more liable to result in overflows. We allow varying this in magic-trace via the `-timing-resolution` parameter, more detail available in the wiki. We don't typically see overflows under the default configuration even on Broadwell server-class parts.
- Clark spent a week on an Intel NUC (mobile Tiger Lake part) toiling away on decode error recovery. For the most part, the data lost are uninteresting branches, and you only need one of the call in / return out of a frame to survive the decode error to be able to construct a frame for it.
We also considered the periodic stack sampling approach for error recovery, but ended up not implementing it since the decode error recovery we implemented ended up being robust enough in practice.
We ended up having more trouble with runtimes that mess with the stack pointer directly. (The kernel does this for the retpoline Spectre mitigation! But perf is smart and rewrites that part of the instruction stream into a jump for us.) There's code in magic-trace to special-case OCaml exceptions, for instance, and it's likely similar code is necessary for some other runtimes too (we have an open issue for Go's coroutine switching).