Hacker News new | ask | show | jobs
by stncls 818 days ago
In many cases yes. Some single-threaded workloads are very sensitive to e.g. memory latency. They end up spending most of their time with the CPU waiting on a cache-missed memory load to arrive.

Typically, those would be sequential algorithms with large memory needs and very random (think: hash table) memory accesses.

Examples: SAT solvers, anything relying on sparse linear algebra

1 comments

Obscure personal conspiracy theory: The CPU vendors, notably Intel, deliberately avoid adding the telemetry that would make it trivial for the OS to report a % spent in memory wait.

Users might realize how many of their cores and cycles are being effectively wasted by limits of the memory / cache hierarchy, and stop thinking of their workloads as “CPU bound”.

Arm v8.4 onwards has exactly this (https://docs.kernel.org/arch/arm64/amu.html). It counts the number of (active) cycles where instructions can't be dispatched while waiting for data. There can be a very high percentage of idle cycles. Lots of improvements to be found with faster memory (latency and throughput).
The performance counters for that have been in the chips for a long time. You can argue that perf(1) has unfriendly UX of course.
I think AMD has a tool to check something somewhat related (Cache misses) in AMD uProf
Right, so does Intel in at least their high end chips. But a count of last-level misses is just one factor in the cost formula for memory access.

I appreciate it’s a complicated and subjective measurement: Hyperthreading, superscalar, out-of-order all mean that a core can be operating at some fraction of its peak (and what does that mean, exactly?) due to memory stalls, vs. being completely idle. And reads meet the instruction pipeline in a totally different way than writes do.

But a synthesized approximation that could serve as the memory stall equivalent of -e cycles for perf would be a huge boon to performance analysis & optimization.