Hacker News new | ask | show | jobs
by compudj 3696 days ago
Currently, LTTng-UST has slightly higher overhead than 100 cycles per event (roughly 250-300 ns/event on recent 2.4GHz Intel), which I expect is partly caused by use of per-CPU buffers rather than per-thread buffers. I have contributed the membarrier system call, and I am currently working on adding restartable sequences and cpu_id cache to the Linux kernel, so the speed of LTTng-UST can be brought closer to the performance of a tracer using per-thread buffers. Keeping per-CPU buffers ensures that the tracer efficiently use memory resources on workloads that have many more threads than CPU cores.

I also notice that filtering out all function entry/exit that take less than 5 microseconds is probably helping reaching those performance numbers. This kind of approach, although very specific to function tracing, seems worthwhile, and could eventually be introduced in LTTng-UST.

Another interesting aspect is that X-Ray seems to directly use the CPU cycle counter. It's all fine when the architectures has reliable TSC sources, but LTTng-UST uses the CLOCK_MONOTONIC vDSO to ensure that we properly fallback on other clock sources (e.g. HPET) whenever the system does not have a reliable TSC source. The extra function calls and seqlock may account for a few extra cycles difference between X-Ray and LTTng-UST.