| HN Mirror

If I understand correctly valgrind (cachegrind) reports L1/L2 cache misses based on a simulated CPU/cache model.

On Linux, you can easily instrument real cache events using the very powerful perf suite. There is an overwhelming number of events you can instrument (use perf-list(1) to show them), but a simple example could look like this:

  $ perf stat -d -- sh -c 'find ~ -type f -print | wc -l'
  ^Csh: Interrupt
   Performance counter stats for 'sh -c find ~ -type f -print | wc -l':
  
               47,91 msec task-clock                #    0,020 CPUs utilized
                 599      context-switches          #   12,502 K/sec
                  81      cpu-migrations            #    1,691 K/sec
                 569      page-faults               #   11,876 K/sec
         185.814.947      cycles                    #    3,878 GHz                      (28,71%)
         105.650.405      instructions              #    0,57  insn per cycle           (46,15%)
          22.991.322      branches                  #  479,863 M/sec                    (46,72%)
             643.767      branch-misses             #    2,80% of all branches          (46,14%)
          26.010.223      L1-dcache-loads           #  542,871 M/sec                    (36,80%)
           2.449.173      L1-dcache-load-misses     #    9,42% of all L1-dcache accesses  (29,62%)
             517.052      LLC-loads                 #   10,792 M/sec                    (22,53%)
             133.152      LLC-load-misses           #   25,75% of all LL-cache accesses  (16,02%)
  
         2,403975646 seconds time elapsed
  
         0,005972000 seconds user
         0,046268000 seconds sys

Ignore the command, it's just a placeholder to get meaningful values. The -d flag adds basic cache events, by adding another -d you also get load and load miss events for the dTLB, iTLB and L1i cache.

But as mentioned, you can instrument any event supported by your system. Including very obscure events such as uops_executed.cycles_ge_2_uops_exec (Cycles where at least 2 uops were executed per-thread) or frontend_retired.latency_ge_2_bubbles_ge_2 (Retired instructions that are fetched after an interval where the front-end had at least 2 bubble-slots for a period of 2 cycles which was not interrupted by a back-end stall).

You can also record data using perf-record(1) and inspect them using perf-report(1) or - my personal favorite - the Hotspot tool (https://github.com/KDAB/hotspot).

Sorry for hijacking the discussion a little, but I think perf is an awesome little tool and not as widely known as it should be. IMO, when using it as a profiler (perf-record), it is vastly superior to any language-specific built-in profiler. Unfortunately some languages (such as Python or Haskell) are not a good fit for profiling using perf instrumentation as their stack frame model does not quite map to the C model.