Hacker News new | ask | show | jobs
by SuperscalarMeme 1944 days ago
Performance is agnostic of ISA. Apple's custom designed cores do indeed have a massive performance/Watt advantage over x86 based designs and happen to be using ARM. However, it's not impossible for an x86 CPU to be designed in a similar way. It does, however, get more difficult to do so due to x86's variable length instruction encoding, to which ARM does not have.
4 comments

x86’s instruction decoder suffers from its inability to parallelize some things. Because instructions have no fixed boundary,[a] something has to process the bytes sequentially. Even if they can be read from memory in massive amounts, something still has to sit there going byte by byte to find the boundaries.

The good news is, once those boundaries are found, uops can be generated. But that ~5% or so of die space is always running full tilt (provided there’s no pipeline stalls).

I’m sure Intel and AMD have put a massive amount of work into theirs to make it as quick as possible,[b] but it’s still ultimately a sequential operation.

With RISC-like architectures like ARM and RISC-V, you don’t need that boundary detector. Just feed the 2 or 4 bytes straight into the decoders.

[a]: Unlike ARM and RISC-V which have fixed 2 or 4 byte encodings (depending on processor mode), x86’s instructions can be anywhere from 1 through 15 bytes.

[b]: Take the EVEX prefix for example. It is always 4 bytes long with the first one being 0x62. So, once you see that 0x62 byte after the optional “legacy prefixes”, you can skip 3 bytes and go to the opcode. But then you need to decode that opcode to see if it has a ModR/M byte, decode that (partially) to see if there’s an SIB byte, decode that to see if there’s a displacement (of 1, 2, or 4 bytes), etc. And then, don’t forget about the immediate (which can be 1, 2, 4, or (in one case of MOV) 8 bytes).

Something has been bugging me about x86’s lack of boundaries...could the boundaries be computed ahead-of-time and passed to the processor?
Not that I’m aware of. The decoding of an instruction is complicated and also dependent on the current operating mode and a few other things. So, for an OS to pass those lengths before hand, it’d have to know everything about the current state of the processor at that instruction. For example, in 16 and 32 bit modes, opcodes 0x40 through 0x4F are single byte INC and DEC (one for each register). In 64 bit mode, those are the single byte REX prefixes; The actual opcode follows. See also: the halting problem.

As for why it became an issue, instruction sets need to be designed from the beginning to be forward expandable. Intel has historically not done that with x86. Take AVX for example. Originally, it was just 128 bit (XMM) vectors encoded as an opcode with various prefix bytes being used in ways they weren’t intended. Later, 256 bit vectors were needed. So they made the VEX prefix. But it only had 1 bit for vector length. This allowed 128 bit (XMM) and 256 bit (YMM) vectors, but nothing else. So when AVX-512 came along, Intel had to ditch it and create the EVEX prefix and allow both to be used. But EVEX only has 2 bits for vector length. So, should something past AVX-512 come out (AVX-768 or AVX-1024?), it’ll probably use the reserved bit pattern 11, and they’ll be stuck again if they want to go past that.

For an example of this being done right, ForwardCom[0] (started by the great Agner Fog) took the “forward compatibility” (hence the name) issue into mind and used 2 bits to signal the instruction length. It’ll probably never reach silicon, but it and RISC-V (which is in silicon form) are good examples of attempting to keep things forward compatible.

[0]: https://forwardcom.info/

> Not that I’m aware of. The decoding of an instruction is complicated and also dependent on the current operating mode and a few other things. So, for an OS to pass those lengths before hand, it’d have to know everything about the current state of the processor at that instruction

The compiler would know the instruction boundaries. It could store that information in a read-only section in the executable. The OS would then just pass that section to the CPU somehow.

I don't think there is anything impossible about this. Would there be sufficient performance benefit to justify the added complexity? I don't know, quite possibly not.

This sounds like a potential attack vector.
I'm not sure why it would be. If the boundary information were wrong, the CPU instruction decode would fail, but that should just be an invalid instruction exception, which operating systems already know how to handle.
"Performance is agnostic of ISA" is too strong a statement. The variable length instruction encoding is a significant performance disadvantage, as is the strict memory ordering requirement of X86/X64.

X64 decoders are indeed only ~5% of the die on a modern CPU, but it's 5% that is always at 100% utilization. That's a non-trivial amount of extra power. X64 decode parallelism is also limited. I've heard four instructions at once as a magic number beyond which it becomes really hard. This is why hyperthreading (SMT) is so common on X64 chips. It's a "cheat" to keep the pipeline full by decoding two different streams in parallel (allowing 8X parallelism). SMT isn't free though. It drags in a lot of complexity at the register file, pipeline, and scheduler levels, and is a bit of a security minefield due to spectre-style attacks. All that complexity adds more overhead and therefore more power consumption as well as taking up die space that could be used for more cores, wider cores, more cache, etc.

ARM is just a lot easier to optimize and crank up performance than X86. The M1 apparently has 8X wide instruction decode, and with fixed length instructions it would be trivial to take it to 16X or 32X if there was benefit to that. I could definitely imagine something like a 16X wide ARM64 core at 3nm capable of achieving up to 16X instruction level parallelism as well as supporting really wide vector operations at really high throughput. Put like 16 of those on a die and we're really far beyond X64 performance in every category.

This is also why SMT/hyperthreading doesn't really exist in the ARM world. There's less to be gained from it. Better to have a simpler core and more of them.

IMHO X86/X64 has hit a performance wall at least in terms of power/performance, and this time it might be insurmountable due to variable length instructions and associated overhead. It matters in the data center as well as for mobile and laptops. There's a reason AWS is pricing to steer people toward Graviton: it costs less to run. Power is the largest component of most data center costs.

While it’s absolutely true that fixed width instructions make parallel decoding vastly easier, there’s a cost in terms of binary footprint size. x86 generally has an advantage in instruction cache and TLB performance for this reason, which can be significant depending on the workload.
Not true. This is a common myth that comes from some old Linus posts in the 32-bit Pentium 4 days and still won't die. I've done comparisons to test this. Compare sizes of modern x86-64 Linux binaries to their counterparts on AArch64. You'll find that they're extremely close.

The biggest problem is all the REX prefixes. The inefficient encoding of registers in x86-64 squandered all the advantages that x86 had.

Is true. They said:

> > x86 generally has an advantage [empahsis added, not "x86-64"]

Obviously if you take the worst of both worlds (bloated and variable-width instructions), you can squander that advantage, but the advantage is in fact real.

Is this still really relevant? I can understood that it can be a problem 20 years ago, but with current processor with huge L1 cache and memory bandwidth, I am starting to think that 4 bytes (or variable 4/8 bytes) is not a bad tradeoff for density Vs superscalar.
L1 size in 1999: 32 kB

L1 size in 2021: 64 kB

The L1 size is yet another place where the x86 legacy hinders things. To avoid aliasing in a virtually indexed L1 cache (which is what you want for performance in a L1 cache, since a physically indexed cache would have to wait for the TLB lookup), the size of each way is limited to the page size, which on x86 is 4096 bytes. To get a 64 KiB L1 cache, it would have to be a 16-way cache, and increasing that too much makes the cache slower and more power-hungry. It's no wonder Apple decided to use a 16 KiB page size instead of a 4 KiB page size; a 64 KiB VIPT L1 cache with 16 KiB page size needs only 4 ways.

For the L1 instruction cache, aliasing shouldn't be a problem (since it's never written to), but this is once again another place where the x86 legacy hinders things: instead of requiring an explicit instruction to invalidate a virtual address in the instruction cache, it's implicitly invalidated when writing to that address.

Apple M1 big core cache sizes:

256KB L1I/128KB L1D

Little cores: 128KB L1I/64KB L1D

Wow. Didn't know that. That should more than compensate for a very slight increase in code size for ARM64 vs X64.

When I use M1, AWS Graviton, or even older Cavium ThunderX chips I can't help but think that X86 is on its way out. The advantage is something you can subjectively see and feel. It's obvious, especially when it comes to power consumption.

Process node has something to do with it, but it's not the whole story. I'm typing on a 10nm Ice Lake MacBook Air and while this chip is better than older 14nm Intel laptops it's still just shockingly crushed by the M1 on every metric. 10nm -> 5nm is not enough to explain that, especially since apparently Intel is more conservative with its numbering and Intel 10nm is more comparable to TSMC 7nm. So it's more like TSMC 7nm vs TSMC 5nm, which is not a large enough gap to account for what seems to be at least 1.5X better performance and 3X better power efficiency.

Some of the X86/X64 apologists remind me of old school aerospace companies dissing not only SpaceX and Blue Origin but the whole idea of reusable rockets, trying to convince us that there's little economic advantage in reusing a $100M rocket stage that consumes ~$100-200K in fuel per launch.

"That's not much of a meteorite. It's no big deal." - Dinosaurs

Que? Look at VLIW ISA's for five minutes and tell me how you've arrived at "agnostic".
Agnostic is a little strong, although it is true that M1 is extremely wide especially for a laptop chip, and wide in ways beyond the decoder which could be applied to an X86 part.

Ultimately these discussions are quite hard because AMD aren't on exactly the same density, and Intel are quite a way behind at the moment.