Hacker News new | ask | show | jobs
by dmitrygr 815 days ago
This misses on an important bit: parallel decoding of instructions. It is a lot harder with variable-length instrs where the length cannot even be calculated from the first byte - you need to read 10 bytes in the worst case to find an instr's len in x86. In aarch64 you need to read 0 bytes to know the length - it is 4

This matters in the way it interacts with i-cache. In aarch64 with 64-byte cache lines, one cache line is 16 instrs. always. In x86 that cache line could contain only 3 whole instrs. So unless your core is able to ingest over one icache line per cycle (intel cores currently are NOT), you are thus limited.

6 comments

https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

>Another oft-repeated truism is that x86 has a significant ‘decode tax’ handicap. ARM uses fixed length instructions, while x86’s instructions vary in length. Because you have to determine the length of one instruction before knowing where the next begins, decoding x86 instructions in parallel is more difficult. This is a disadvantage for x86, yet it doesn’t really matter for high performance CPUs because in Jim Keller’s words:

>For a while we thought variable-length instructions were really hard to decode. But we keep figuring out how to do that. … So fixed-length instructions seem really nice when you’re building little baby computers, but if you’re building a really big computer, to predict or to figure out where all the instructions are, it isn’t dominating the die. So it doesn’t matter that much.

>...

>Researchers agree too. In 2016, a study supported by the Helsinki Institute of Physics[2] looked at Intel’s Haswell microarchitecture. There, Hiriki et al. estimated that Haswell’s decoder consumed 3-10% of package power. The study concluded that “the x86-64 instruction set is not a major hindrance in producing an energy-efficient processor architecture.”

I did not talk about power - i talked about perf. No modern x86 chip can decode 6 or 7 of these long instrs per cycle. there are aarch64 chips that can
Perhaps it's compensated by the fact a single x86 instruction does more? If a bunch of those aarch64 instructions would be loads and stores, but for x86 they're part of the arithmetic instructions, then it maybe doesn't matter?
What impact does it have on the overall performance though? Keller's argument is that the effect is small/negligible.
Keller's argument (as stated) is that it doesn't take up much die space. Hiriki's argument is that it doesn't consume much power. Neither addresses dmitrygr's argument, which is about performance and bottlenecks. (It could use very little power and very little space and still be a very big bottleneck.)

That doesn't mean that dmitrygr is correct. It means that everyone trying to answer him is arguing about the wrong thing.

The main issue with that argument is that the L1i cache can never realistically be exhausted fast enough to form a bottleneck, as long as the decoder is working ahead of the start of the execution pipeline.

The hard limit on instruction size is 15 bytes, so a 64-byte cache line will always be able to store at least 4 of them. (Or 3 plus the tail of an instruction from a previous line.) Meanwhile, on the other end, Intel cores can only retire up to 4 μops per cycle. Since each instruction takes at least 1 μop (except for macro-fusion, which only works on short instructions), retirement will always form a bottleneck before decoding can.

And in realistic code where you'd actually see these long instructions, i.e., hot SIMD loops, all the decoded instructions would stay warm and toasty in the μop cache (allegedly holding 6 fixed-size μops per cache line) after the first iteration.

> It could use very little power and very little space and still be a very big bottleneck.

I believe in chip design, this doesn't really happen (often). You can optimize the bottlenecks by allocating it more space and power.

I interpret Keller's statement indirectly - given that modern x86 CPUs dedicate only a small part of its circuitry to decoding logic means that it's not a bottleneck (otherwise there would be more circuitry for it).

The total architectural difference is pretty small in general. Like, say switching a chip from Intel to ARM lets you make it 30% faster. For the last several decades that was insignificant. Not so much these days though.

The decode difficulty may make a 5% difference, but add in the other things people have mentioned and maybe it adds up to 30%. (numbers pulled out of my arse)

Do you have benchmarks showing this? People would switch to ARM if this is true. Note Linux and Windows runs just fine ARM.
Difficult to benchmark, but ... people are switching to ARM. You've heard of the M1 right?
I believe that modern x86 processors store decoded micro-ops in the I$ instruction cache.

I always understood micro-ops to be fixed length.

Sure, you have to decode the variable length instructions at some point. But that extra work, relative to aarch64, is in practice amortized over the lifetime of that cache line.

That's not how it works for either Intel or AMD's current designs. Both use an L1 I$ which consists of encoded instructions, while adding a small uop cache (sometimes called L0) for recently decoded instructions.[1][2]

Intel's Netburst architecture stored decoded instruction sequences in its L1 cache, which Intel called a trace cache.[3] This didn't work out too well, so Intel reverted to a conventional L1 cache with the successor Merom[4] and introduced the uop cache shortly after with Sandy Bridge[5].

[1]https://chipsandcheese.com/2021/12/02/popping-the-hood-on-go...

[2]https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-fron...

[3]https://chipsandcheese.com/2022/06/17/intels-netburst-failur...

[4]https://chipsandcheese.com/2023/02/05/intels-dunnington-core...

[5]https://chipsandcheese.com/2023/08/04/sandy-bridge-setting-i...

> So unless your core is able to ingest over one icache line per cycle (intel cores currently are NOT), you are thus limited.

Do Intel cores no longer have a μop cache in front of the L1i cache?

>It is a lot harder with variable-length instrs where the length cannot even be calculated from the first byte - you need to read 10 bytes in the worst case to find an instr's len in x86. In aarch64 you need to read 0 bytes to know the length - it is 4

x86's approach to variable-length instructions is unfortunate.

In contrast, RISC-V leverages variable-length encoding to get the best code density among 64bit ISAs while sidestepping the instruction boundary problem.

(I digress, but note that while for the 32bit ISA RISC-V code density was competitive yet bested by ARM thumb2, it has since improved; RISC-V has the best density overall)

The length of a RISC-V instruction is in the first byte though, not the tenth.

Note that RISC-V's code density with the C extension is in bytes, not in number of instructions. The core integer ISA was designed to be extensible from small embedded MCUs, so every other chip has to use it. High-performance RISC-V cores depend a lot on macro-op fusion to run as fast as 64-bit ARM.

>not in number of instructions.

This comes up very often, but is an unfounded concern. Not only is instruction count competitively low, but as it turns out, critical paths of inter-dependent instructions are, at worst (w/o fusion nor 2019+ extensions), no worse than aarch64[0].

>The core integer ISA was designed to be extensible from small embedded MCUs, so every other chip has to use it.

There's so much to unpack here. Firstly, the ISA, as documented in the specification itself[1], is described as "An ISA separated into a small base integer ISA, usable by itself as a base for customized accelerators or for educational purposes, and optional standard extensions, to support general-purpose software development." Note there's no reference to small embedded MCUs in there.

Furthermore, the spec elaborates "An ISA that avoids “over-architecting” for a particular microarchitecture style (e.g., mi- crocoded, in-order, decoupled, out-of-order) or implementation technology (e.g., full-custom ASIC, FPGA), but which allows efficient implementation in any of these.".

>High-performance RISC-V cores depend a lot on macro-op fusion to run as fast as 64-bit ARM.

First news. There seems to be some confusion here. 64-bit ARM (aarch64) is implemented in a range of microarchitectures, targeting different uses. I will go ahead and assume (for convenience) that you meant specifically very high performance implementations, as used in workstations and servers.

These tend to be superscalar and very wide (ARM M1 and Tenstorrent Ascalon are 8-wide). Their execution units tend to be simpler, and instead there's more of them and some can only do specific tasks. Typically, for these macro-op fuse-able instructions, an ARM microarchitecture will have to emit multiple micro-ops, whereas in RISC-V they already come as separate instructions.

0. https://dl.acm.org/doi/pdf/10.1145/3624062.3624233

1. https://riscv.org/technical/specifications/ (unprivileged architecture)

I think you are missing the only point of the article: performance and compatibility are important; everything else is just aesthetics.

As long as Intel can produce fast CPUs, with new features and while maintaining support for the existing binaries, everything is OK. Fixed or variable length, that's a matter for Intel engineers: users could, and should, care less.

Most important applications have an ARM version now. Especially true since Apple Silicon and AWS Graviton. Windows will force developers to compile both x86 and ARM versions.
It's a nice theory but I don't think it holds up. X64 executes from a micro op cache and there's no particular reason to expect the ops in that to be variable length encoded. Thus it only goes to the i-cache when that misses, at which point you've spent long enough digging around in the cache that the extra decoding probably doesn't matter.

It's of like saying x64 is limited by only having 16 registers - there's only names for 16ish in the ISA, but there's loads more registers in the machine as part of hiding latency.

>It's of like saying x64 is limited by only having 16 registers - there's only names for 16ish in the ISA, but there's loads more registers in the machine as part of hiding latency.

Why not have just one, then?

After all, there's loads more registers in the machine as part of hiding latency.

The ISA either matters or it does not. Pick one.

usually the really fat instructions take over 1 cycle anyway, right? so the decoder should be able to keep up
pipelining...

they are usually piplineable

Those pipelines come with an area cost, a power cost and a latency cost.