Thanks! Would love to hear more about the counters that your interested in. We've exposed more in C5 than in previous instance types and we are trying to make more available over time in a safe way.
- General performance analysis. For this more counters is generally incrementally better.
- Running https://github.com/mozilla/rr. This requires the retired-branch-counter to be available (and accurate - sometimes virtualization messes that up)
The second one I actually care more about, because I've pretty much stopped trying to debug software when rr is not available, too painful ;). Feel free to email me (email is in my profile) for gory details.
For the benefit of anyone reading this, KVM and VMWare virtualization generally work. Xen has problems because of a stupid Xen workaround for a stupid Intel hardware bug from a decade ago. I can provide more details about that via email (in my profile) if desired.
Seconding paulie_a, We're running a Xen stack right now and I haven't heard of this. We've worked around a few nasty bugs with Xen and linux doms already, but I'm wondering if we have this problem you're referring to and don't even know it.
One of the things the performance monitoring unit (PMU) is capable of doing is triggering an interrupt (the PMI) when a counter overflows. When combined with the ability to write to the counters, this lets you program the PMU to interrupt after a certain number of counted events. Nehalem supposedly had a bug where the PMI fires not on overflow but instead whenever the counter is zero. Xen added a workaround to set the value to 1 whenever it would instead be 0. Later this was observed on microarchitectures other than Nehalem and Xen broadened the workaround to run on every x86 CPU. Intel never provided any help in narrowing it down and there don't seem to be official errata for this behavior too.
This behavior is ok for statistically profiling frequent events but if you depend on exact counts (as rr does) or are profiling infrequent events it can mess up your day.
rr works fine on multithreaded (and multiprocess) applications. It does emulate a single core machine though, so depending on your workload and how much parallelism your application actually has it might be painful.
- General performance analysis. For this more counters is generally incrementally better.
- Running https://github.com/mozilla/rr. This requires the retired-branch-counter to be available (and accurate - sometimes virtualization messes that up)
The second one I actually care more about, because I've pretty much stopped trying to debug software when rr is not available, too painful ;). Feel free to email me (email is in my profile) for gory details.