| I wrote an extremely fast hot polled pipe in Rust for QEMU instrumentation. I’m sure there’s room to improve but it’s effectively bottlenecking on the uarch. https://github.com/MarginResearch/cannoli/blob/main/mempipe/... This was specifically designed for one producer (the QEMU processor), and many consumers (processing the trace of instructions and memory accesses). I can't remember what specific tunings I did to help with this specific model. Data structures like this can always be tuned slightly based on your workload, by changing structure shapes, shared cache lines, etc. With about 3-4 consumers QEMU was not really blocking on the reporting of every single instruction executed, which is really cool. This requires a low noise system, just having a browser open can almost half these numbers since there's just more cache coherency bus traffic occuring. If having a browser open doesn't affect your benchmark, it's probably not bottlenecking on the uarch yet ;) https://github.com/MarginResearch/cannoli/blob/main/.assets/... A little blog on Cannoli itself here: https://margin.re/2022/05/cannoli-the-fast-qemu-tracer/ Ultimately, mempipe is not really a big thing I've talked about, but it's actually what makes cannoli so good and enables the design in the first place. |
https://raw.githubusercontent.com/MarginResearch/cannoli/mai...
You can see a massive improvement for shared hyperthreads, and on-CPU-socket messages.
In this case, about ~350 cycles local core, ~700 cycles remote core, ~90 cycles same hyperthread. Divide these by your clock rate as long as you're Skylake+ for the speed in seconds. Eg. about 87.5 nanoseconds for a 4 GHz processor for local core IPC.