|
If I understand correctly valgrind (cachegrind) reports L1/L2 cache misses based on a simulated CPU/cache model. On Linux, you can easily instrument real cache events using the very powerful perf suite. There is an overwhelming number of events you can instrument (use perf-list(1) to show them), but a simple example could look like this: $ perf stat -d -- sh -c 'find ~ -type f -print | wc -l'
^Csh: Interrupt
Performance counter stats for 'sh -c find ~ -type f -print | wc -l':
47,91 msec task-clock # 0,020 CPUs utilized
599 context-switches # 12,502 K/sec
81 cpu-migrations # 1,691 K/sec
569 page-faults # 11,876 K/sec
185.814.947 cycles # 3,878 GHz (28,71%)
105.650.405 instructions # 0,57 insn per cycle (46,15%)
22.991.322 branches # 479,863 M/sec (46,72%)
643.767 branch-misses # 2,80% of all branches (46,14%)
26.010.223 L1-dcache-loads # 542,871 M/sec (36,80%)
2.449.173 L1-dcache-load-misses # 9,42% of all L1-dcache accesses (29,62%)
517.052 LLC-loads # 10,792 M/sec (22,53%)
133.152 LLC-load-misses # 25,75% of all LL-cache accesses (16,02%)
2,403975646 seconds time elapsed
0,005972000 seconds user
0,046268000 seconds sys
Ignore the command, it's just a placeholder to get meaningful values. The -d flag adds basic cache events, by adding another -d you also get load and load miss events for the dTLB, iTLB and L1i cache.But as mentioned, you can instrument any event supported by your system. Including very obscure events such as uops_executed.cycles_ge_2_uops_exec (Cycles where at least 2 uops were executed per-thread) or frontend_retired.latency_ge_2_bubbles_ge_2 (Retired instructions that are fetched after an interval where the front-end had at least 2 bubble-slots for a period of 2 cycles which was not interrupted by a back-end stall). You can also record data using perf-record(1) and inspect them using perf-report(1) or - my personal favorite - the Hotspot tool (https://github.com/KDAB/hotspot). Sorry for hijacking the discussion a little, but I think perf is an awesome little tool and not as widely known as it should be. IMO, when using it as a profiler (perf-record), it is vastly superior to any language-specific built-in profiler. Unfortunately some languages (such as Python or Haskell) are not a good fit for profiling using perf instrumentation as their stack frame model does not quite map to the C model. |