Hacker News new | ask | show | jobs
Ask HN: What's your best C/C++ profiling tool, hints and best practice?
21 points by henrikm85 3948 days ago
I frequently come across wanting to profile C++ on Linux. I have used perf a lot before but do not have vtune handy. I have dabbled around with poor man's profiler but that seems to get trickier with lots and lots of threads. What are your favorite outside-the-box approaches? How do you figure out contention or IO wait issues?
6 comments

HPCToolkit and TAU are good options for profiling C++ applications. They come from HPC so they are intended for use with parallel and highly concurrent applications.

http://hpctoolkit.org/ https://www.cs.uoregon.edu/research/tau/home.php

Threadspotter (paratools.com) and maqao.org might be of interest, at least for x86_64 GNU/Linux but I wouldn't know about C++ specifics.

[TAU is doubtless a good bet, but for what it's worth for general interest, the other common systems for HPC are openspeedshop.org, cube/scalasca (scalasca.org), and extrae/paraver (bsc.es). A good comparison of them all would be useful, but I've not found one.]

Any tips on profiling windows device drivers? I've tried xperf, but it doesn't allow you (or I haven't seen how) to change the frequency of the sampling, and can only do a system-wide profile.

I've also tried vtune, but it doesn't support stack-tracing (or things like lbr) for system-wide profiling, and it doesn't have a specific option for sampling drivers. You can attach to the System process, but then you're missing a lot of the your driver code, that runs in other contexts.

I kept thinking about implementing my own sampling profiler (using LBR for stack-tracing, and hardware performance events, like linux's oprofile/ freebsd's hwpmc), but I can't see how I could only profile my driver, and not the whole system, without hooking the Windows scheduler. I guess I will just profile the whole system and check if the program counter is inside my module.

Before you start, determine what you are attempting to optimize. Throughput or latency? Improving averages, or reducing how often below-acceptable performance occurs?

Write end2end tests that execises the application as close to what user would. Then, use a profiler with an API so you can start dump when test/app setup is completed (to avoid extranous noise/misleading data). I like gperftools combined with KCachegrind as a GUI. Used it very successfully for instance in MyPaint: http://www.jonnor.com/2012/11/improved-drawing-performance-i...

Your question is a bit all over the map - are you interested in reducing CPU usage, reducing time spent in locks, or do you want to talk to the kernel more efficiently? Are you targeting, throughput, latency?

That said, http://www.brendangregg.com/flamegraphs.html is a nice introduction to a site that has lots of material.

valgrind's got some good features along these lines: http://valgrind.org/info/tools.html
Valgrind is nice, however, especially with multi-threaded programs the virtualized execution diverges from a non-valgrind-VM run quite a bit so I am not a huge fan.
Valgrind is good.

Gprof is good too.

Very interested in this too
me too