Also worth mentioning here is perf[1], which is great for low overhead profiling. Also, perf profiles can be turned into profiles compatible with GCC and LLVM PGO to build optimized binaries based on production runs, using autofdo[2]. In my use case, the instrumentation overhead was too high to use regular profiling on production workloads.
perf and its ilk are obviously useful, but you need to be aware of several cans of worms with sampling hardware counters, in particular. These include the timing mechanism for sampling, the documentation and intrinsic usefulness of particular counters, and issues with multiplexing more than what can be used simultaneously. For multiplexing see, for instance, https://www.research.manchester.ac.uk/portal/files/59933625/...