Hacker News new | ask | show | jobs
by Shish2k 1535 days ago
> Counting instructions is very accurate and roughly approximates power usage

I’ve always assumed this to be true, but I see a lot of benchmarking tools / libraries measuring wall-clock time or iterations-per-second or something like that, I’ve never seen a benchmark tool which counts CPU instructions. Am I being blind or is there some other reason that I’m not seeing them? :S

4 comments

At the end of the day most people care about wall clock time. It's a real physical value that's easy to understand and easy to compare between systems. Plus, if two functions execute say, 1 billion instructions each, but one spends extra time stalled waiting on IO or data fetches from RAM, you definitely want to account for that in normal benchmarking.

Instruction counting is more of a specialized tool but I like to use it whenever I can because it has low variance and makes comparing changes a lot easier. Compare how bumpy these graphs are for instruction count (first link) and wall clock time (second link):

https://perf.rust-lang.org/

https://perf.rust-lang.org/?start=&end=&kind=raw&stat=wall-t...

Counting instructions does not give information about time spent in syscalls/doing IO, which limits its use to CPU-bound software.
Instructions correlate to energy but not to performance. If you're benchmarking performance you should use wall clock time.
Counting instructions properly is hard and also results in a good amount of overhead if you don't use a bunch of tricks or a kernel module.

You also can't really count instructions in the cloud.

Counting (userspace) instructions is relatively easy regardless of language with perf stat, though it does require the kernel module. Generally speaking it should just work if perf is installed through the package manager for your distribution.

edit: valgrind's callgrind utility can also produce exact instruction execution counts for a given block of code

Callgrind can give you instruction counts yes. It doesn't simulate any microarchitecture other than caches which means its only useful for comparing with itself.

Perf stat is very very high overhead. The perf API is available and can be tuned a bit more nicely but it's mostly a horrible mess. It uses bitfields too which makes it somewhat hard to get to from other languages unless you trust the shifts and masks.