Hacker News new | ask | show | jobs
by vlovich123 2115 days ago
> Contrary to what you might expect, instruction counts have proven much better than wall times when it comes to detecting performance changes on CI, because instruction counts are much less variable than wall times (e.g. ±0.1% vs ±3%; the former is highly useful, the latter is barely useful). Using instruction counts to compare the performance of two entirely different programs (e.g. GCC vs clang) would be foolish, but it’s reasonable to use them to compare the performance of two almost-identical programs (e.g. rustc before PR #12345 and rustc after PR #12345). It’s rare for instruction count changes to not match wall time changes in that situation. If the parallel version of the rustc front-end ever becomes the default, it will be interesting to see if instruction counts continue to be effective in this manner.

This is a supremely surprising conclusion, especially in 2020. Is instruction count really still tied to wall clock count? I would have thought that some instructions could be slower than others (especially on x86) so that using more faster individual instructions could be faster than 1 slower instruction. Similarly, cache effects & data dependencies can result in more instructions being faster than fewer instructions.

I think what the author is trying to say is that when evaluating micro-optimizations, cycle counts are pretty valuable still because you're making a small intentional change & evaluating its impact & usually the correlation holds. The dashboard clearly still measures wall-clock since just comparing instruction count over time would be misleading.

I'm curious if the Rust team has evaluated stabilizer to be more robust about the optimizations they choose: https://emeryberger.com/research/stabilizer/

4 comments

> This is a supremely surprising conclusion

That's why I started the paragraph with "Contrary to what you might expect".

As for Stabilizer: "Stabilizer eliminates measurement bias by comprehensively and repeatedly randomizing the placement of functions, stack frames, and heap objects in memory." Those placements can affect cycle counts and wall times a lot, but don't affect instruction counts.

So have you not found in practice any data dependencies or cache issues show up as bottle necks? Or do current tools just make this more of a blind spot for optimization?

Also is there any work to multi-thread the Rust compiler on a more fine-grained level like the recent GCC work? I know you allude to that potentially that would make the instruction counts potentially less reliable so wondering if that's something being explored.

Finally, while I have you, I'm wondering if there's been any exploration of the idea of keeping track of information across builds so that incremental compilation is faster (i.e. only bother recompiling/relinking the parts of the code impacted by a code change). I've always thought that should almost completely eliminate compilation/linking times (at least for debug builds where full utmost optimization is less important).

I mentioned in the post several areas I myself haven't looked at, including cache misses. There may be room for improvements there.

There is an experimental parallel rustc front-end, e.g. see https://internals.rust-lang.org/t/help-test-parallel-rustc/1...

> any exploration of the idea of keeping track of information across builds so that incremental compilation is faster

That's exactly what incremental compilation does.

There's an effort to track which functions are modules & what the downstream implications of that are in terms of needing recompilation? Are there any links to technical descriptions? Super interested in reading up on the technical details involved.
It isn't surprising to me that instruction counts as a performance metric would have less variance than wall time. Did you know that the environment size can actually affect wall time?

> We see that something external and orthogonal to the program, i.e., changing the size (in bytes) of an unused environment variable, can dramatically (frequently by about 33% and once by almost 300%) change the performance of our program. This phenomenon occurs because the UNIX environment is loaded into memory before the call stack. Thus, changing the UNIX environment size changes the location of the call stack which in turn affects the alignment of local variables in various hardware structures.

From https://www.inf.usi.ch/faculty/hauswirth/publications/asplos....

That's exactly my point! The only thing that matters is wall clock time so if cache layout is making a 300% difference to wall clock, focusing on instructions counts will mislead you on the most impactful optimization to make (i.e. cache layout optimizations won't show up in instruction counts).

And yes. I'm aware of that result because of Professor Berger's talks on Coz & the other work he's done in this space.

> The only thing that matters is wall clock time so if cache layout is making a 300% difference to wall clock, focusing on instructions counts will mislead you on the most impactful optimization to make

Ahh, very well said!

> I would have thought that some instructions could be slower than others (especially on x86) so that using more faster individual instructions could be faster than 1 slower instruction..

There are some fun cases where that is definitely true, to whit pdep / pexp on Zen based architectures. https://dolphin-emu.org/blog/2020/02/07/dolphin-progress-rep...

https://twitter.com/uops_info/status/1202950247900684290

> I just ran some tests: the performance seems to depend heavily on the value in the last operand; this is also the case for the register variants. If the last operand is set to -1 (i.e., all bits are 1), the instr. has 518 uops and needs more than 289 cycles!

From what I can tell, Zen based architectures have a slow compatibility-only emulation of those two instructions because the fast implementation is patented - and it's patented by a university who've got some kind of deal with Intel involving co-development of the instruction, rather than by Intel themselves, so AMD's patent cross-license doesn't cover it.
I'd guess there are instructions (eg, x87) that are simply forbidden from use as being universally slower. Once you account for those and are comparing only desirable instructions, there should be a pretty reasonable correlation barring the occasional edge case like mixing AVX512 lightly with other instructions (due to downclocking).