Hacker News new | ask | show | jobs
by dmitrygr 815 days ago
I did not talk about power - i talked about perf. No modern x86 chip can decode 6 or 7 of these long instrs per cycle. there are aarch64 chips that can
2 comments

Perhaps it's compensated by the fact a single x86 instruction does more? If a bunch of those aarch64 instructions would be loads and stores, but for x86 they're part of the arithmetic instructions, then it maybe doesn't matter?
What impact does it have on the overall performance though? Keller's argument is that the effect is small/negligible.
Keller's argument (as stated) is that it doesn't take up much die space. Hiriki's argument is that it doesn't consume much power. Neither addresses dmitrygr's argument, which is about performance and bottlenecks. (It could use very little power and very little space and still be a very big bottleneck.)

That doesn't mean that dmitrygr is correct. It means that everyone trying to answer him is arguing about the wrong thing.

The main issue with that argument is that the L1i cache can never realistically be exhausted fast enough to form a bottleneck, as long as the decoder is working ahead of the start of the execution pipeline.

The hard limit on instruction size is 15 bytes, so a 64-byte cache line will always be able to store at least 4 of them. (Or 3 plus the tail of an instruction from a previous line.) Meanwhile, on the other end, Intel cores can only retire up to 4 μops per cycle. Since each instruction takes at least 1 μop (except for macro-fusion, which only works on short instructions), retirement will always form a bottleneck before decoding can.

And in realistic code where you'd actually see these long instructions, i.e., hot SIMD loops, all the decoded instructions would stay warm and toasty in the μop cache (allegedly holding 6 fixed-size μops per cache line) after the first iteration.

> It could use very little power and very little space and still be a very big bottleneck.

I believe in chip design, this doesn't really happen (often). You can optimize the bottlenecks by allocating it more space and power.

I interpret Keller's statement indirectly - given that modern x86 CPUs dedicate only a small part of its circuitry to decoding logic means that it's not a bottleneck (otherwise there would be more circuitry for it).

The total architectural difference is pretty small in general. Like, say switching a chip from Intel to ARM lets you make it 30% faster. For the last several decades that was insignificant. Not so much these days though.

The decode difficulty may make a 5% difference, but add in the other things people have mentioned and maybe it adds up to 30%. (numbers pulled out of my arse)

Do you have benchmarks showing this? People would switch to ARM if this is true. Note Linux and Windows runs just fine ARM.
Difficult to benchmark, but ... people are switching to ARM. You've heard of the M1 right?
Or maybe x86 CPUs were until recently designed with performance in mind instead of energy/eff and Intel+AMD were slower at going into that direction?

Just wait until Lunar Lake is released in this year. It should be x86 energy eff CPU

Intel chips have been in laptops for literally decades. I don't think that can be the reason.

In any case there's not a huge difference between power efficiency and maximum performance, especially today, because maximum performance is generally power/cooling limited.