| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by haberman 1605 days ago

The visualization tools presented look really nice, but they seem to present program execution as sequential and linear, which is a model that seems like it will really break down at these time scales (10s of cycles).

Modern processors will look hundreds of instructions into the future and try to start executing them as soon as possible. Branches are predicted far in advance of when they can actually be evaluated. Many instructions can be executing simultaneously. A clean tidy flame graph showing 1-3ns slices (~5 cycles) cannot help but be a vast simplification of what the CPU is really doing.

The linked page about Processor Trace says this:

> instruction data (control flow) is perfectly accurate but timing information is less accurate

The article mentions using magic-trace to detect changes in inlining decisions made by the compiler. This is a case where it will shine, since PT can perfectly capture the control flow, and it doesn't necessarily rely on having perfect timestamps for everything.

3 comments

samhw 1603 days ago

Hey - I wanted to say that I came across a comment of yours from more than a decade ago (https://news.ycombinator.com/item?id=2328627), and I was startled at how accurate it is as a prediction of how parsers and IDEs are combined today, about 11 years later. I'm glad you're still commenting on here (and what a criminal understatement it is for that page to characterise their tool's flaws as "timing information is less accurate" - that's bloody execution order that you're talking about!).

Anyway, I wanted to say how much I appreciate your comment of 10 years ago. I'm also a parser nerd, and a performance nerd, and I feel strongly that programmers have a professional responsibility to write code in a way that expresses our intent by a logical minimum of instructions/work. I strongly suspect that this will become important again in the future, not because the ratio of software-efficiency to hardware-power decreases again, but because climate concerns will drive us to measure our code in performance-per-watt rather than performance-per-dollar (depending on what action is taken on carbon pricing, it may be a distinction without a difference).

I look forward to the day when grossly inefficient software is rightly considered to be as unacceptable as grossly inefficient SUVs, and people in our profession are forced to take responsibility for the damage that their obscenely inefficient crap is doing. I hope Python 4 comes with a snorkel.

link

rrss 1605 days ago

It seems like this is basically unavoidable on existing hardware, though, right?

if we imagine there existed some visualization that could more accurately represent the complexity of a core, I don’t know how it would be possible to get the data, because AFAIK there are no methods to trace processor execution for modern processors at higher fidelity than this.

even sampling profilers have similar issues with being limited to the model of sequential instruction streams, since each sample gives a single program counter, not the full view of everything the core has in flight.

link

haberman 1605 days ago

Yes, I agree that higher-resolution data is not readily available. LLVM MCA has a timeline view that attempts to visualize the overlapping execution of instructions (https://llvm.org/docs/CommandGuide/llvm-mca.html#timeline-vi...), but this is based on models of how the CPU works (not runtime-collected data), and these models are not perfect.

I also agree that sampling profilers have the same issue: instruction-level views of sampling profiles should be taken with a grain of salt.

My concern is that flame graphs with 1-3ns of resolution are presented as a selling point of the tool, without any mention of the caveats around how this model really breaks down at this time scale. I would like to know more details of how the PT data actually relates to the out-of-order execution. Does a branch's timestamp correspond to when that branch was retired? Do we actually know what the timestamp corresponds to, or is it not well-specified? Are there cases where the timestamp is known to be misleading about the true bottleneck?

I don't know the answers to these questions, but I see a tool like this, I really want more information about the strengths and limitations of the data.

link

mlyle 1605 days ago

> A clean tidy flame graph showing 1-3ns slices (~5 cycles) cannot help but be a vast simplification of what the CPU is really doing.

Sure, the thinnest slices on the highest zoom are going to be misleading. They're also not what you generally will want to be looking at (though they may provide context for you to identify the part higher functions taking a long time or hints about cache contention etc).

link