Hacker News new | ask | show | jobs
by voidlogic 2665 days ago
"On the other hand, that wastes cache memory and fetch bandwidth..."

This. One of the reasons why modern x86 processors have such strong performance is their external facing CISC interface effectively acts as memory compression while they are RISC on the inside. In many ways its a best of both worlds that was achieved through incremental evolution.

5 comments

No, x86-64 is just as inefficient as AArch64 these days, because of all the REX prefixes. Almost every instruction needs one and it bloats the size tremendously, to the extent that most x86-64 instructions are just about 4 bytes. Measure binary sizes if you don't believe me.
Almost every instruction needs one and it bloats the size tremendously, to the extent that most x86-64 instructions are just about 4 bytes.

That's only if your code happens to be particularly "64-bit-heavy", or the compiler isn't doing a good job at selecting registers; the original designers (at AMD, not Intel) decided on the prefices (and defaulting to 32-bit for most ops) instead of defaulting all operations to 64-bit in 64-bit mode precisely because it would be better for size and performance --- using their carefully optimised compilers. Plus, what can be done with a single 4-byte instruction on x86 can require multiple 4-byte ARM instructions, and that adds up quickly.

I can't find it at the moment but one of the studies I remember comparing the binary sizes was using GCC, which is widely available and free, but probably one of the worst compilers at x86 size optimisation I've seen. I even recall a remark in that study about how it was generating mostly RISC-like instructions, so in other words they were comparing binaries generated for a RISC CPU using a RISC-oriented compiler with ones generated for a CISC CPU using a RISC-oriented compiler, failing to exploit the full capabilities of a CISC.

I've written x86 Asm for several decades (started with 16-bit --- dating myself here...), and done some occasional MIPS and ARM, and it's very difficult for me to believe that the RISCs have any intrinsic advantage in code density other than the fact that compilers for x86 aren't that great at it; you can write a Fibonacci calculator for the latter in 5 bytes and pushes and pops are single-byte instructions, while on the former even a register-register move is 4 bytes.

> That's only if your code happens to be particularly "64-bit-heavy", or the compiler isn't doing a good job at selecting registers

No, that's true for basically all code. 6 or 7 registers isn't enough for basically anything interesting, so you end up pretty much always hitting the high registers.

> Plus, what can be done with a single 4-byte instruction on x86 can require multiple 4-byte ARM instructions, and that adds up quickly.

The only real difference is that you have memory load addressing modes in x86, while for load-store architectures like AArch64 you don't. But:

* On x86-64 you have two-address instructions, not three-address instructions. This means that AArch64 "sub x9,x10,x11", or "49 01 0b cb", becomes x86-64 "mov r9,r10; sub r9,r11", or "4d 89 d1 4d 29 d9": 4 bytes vs. 6, thanks to the doubled REX prefix.

* On x86-64 immediates are very inefficiently encoded, while they tend to be compressed on RISCs to fit in the 32-bit instruction word. The end result is that AArch64 "sub x9,x10,#1234", or "49 49 13 d1" in 64-bit mode becomes x86-64 "lea r9,[r10-1234]", which is "4d 8d 8a 2e fb ff ff": 4 bytes vs. 7.

> I can't find it at the moment but one of the studies I remember comparing the binary sizes was using GCC, which is widely available and free, but probably one of the worst compilers at x86 size optimisation I've seen.

LLVM is doing pretty well at x86-64 size optimization: for example, it prefers to select lower registers to reduce size. As I recall, Dan Gohman told me the code size win was something on the order of 2%. It really doesn't make a big difference: AArch64 and x86-64 have about the same code size.

> you can write a Fibonacci calculator for the latter in 5 bytes

But real code, again, hits the high registers.

> pushes and pops are single-byte instructions

Pushes and pops aren't used by most compilers, except in function prologs and epilogs. This is actually an example of inefficiency in the design of x86-64. The opcode space shouldn't go to functions that are only used to set up and tear down functions.

> on the former even a register-register move is 4 bytes

"mov r11,r12" is 3 bytes on x86-64. Not a big difference…

I even recall a remark in that study about how it was generating mostly RISC-like instructions,

There is a very good reason for GCC's x86 backend to do this. Intel/AMD optimisation manuals provide a subset of x86 instructions that are worth using. Instructions that are actually fast in modern designs, that don't fall back to legacy microcode.

This subset looks very RISC-like.

Sure, x86 has CISC instructions that sometimes allow very dense code, but if you want your code to actually run fast you need to do it the RISC way.

I would love to see a newer CISC that didn't have all these required prefixes, and took a more Huffman encoding perspective so that instructions like hlt aren't allocated a single byte. Memory to memory ops essientially let you encode physical registers without using architectural registers, saving additional bits too.
I have encountered an instruction set which supported 16bit, 32bit, 48bit and 80bit encodings.

Was basically Thumb2 but with more flexibility, and it still felt like RISC.

And RISC-V encodes programs into fewer bytes than X86, so it wins in that regard. But there are still no implementation that have all the other features needed for top performance. There are many factors that affect performance.
"There are many factors that affect performance."

Not only man different features contribute to performance but also that put performance depends so much on use-case. Two CPUs implementing the same arch might each win a benchmark that has a different instruction mix or memory access pattern, etc.

Nope, you don't need anything like the complexity of x86 decode in order to reach good instruction density. RISC-V matches x86 with a simple "compressed" extension involving 16-bit forms for some of the most common instructions (but still quite straightforward to decode in hardware).
I agree. Its almost always more performant to intentionally build a system that to have characteristics that another system incrementally evolved to have. This is why rewritten software (if actually delivered) often performs better than the original, its usually this and not new framework X or new language Y that resulted in the win.
Not being any kind of expert in processor microarchitecture, I wonder if there is potential in just actually compressing instructions, as in with a huffman table. Could hardware decoding be made fast enough that the cache and fetch savings would make up for it?
Modern ISAs like RISCV come up with their space conscious portions of their ISAs by taking a sort of human encoding perspective on the instruction stream. The definitely iterated on that base concept of allocate the number of bits needed based on frequency in the stream.
Apple is already doing this (transparent memory/cache compression) on the GPU side, and there is some speculation they are doing it for the LLC on the CPU side, or may start doing it soon, based on patents they have filed.
If you look at the benchmarks for x32 (ILP32) you see an improvement of up to 10% over x86-64. Granted this most benefits pointer heavy code, but is still doesn't make up for the lions share of performance.
That kind of benchmark shows the data path cost of big pointers but misses much of the instruction path cost of inefficient encoding, because the cost doesn't usually manifest as bottlenecking at the frontend. The cost is having a frontend that can keep up. On Intel cores this involves having a huge decoder that can deal with arbitrary alignment and unbounded sequences of prefixes; and having caches both for instructions and decoded uops (multiple types of the latter). Plus fundamental limitations on what the backend can do: there would be no point adding a 4th vector op x-port, because the frontend is miles away from being able to keep up with that many instruction bytes. All of that, and still the programmer/compiler walks a knife's edge avoiding frontend bottlenecks, trying to keep code tight and aligned so the important parts fit in loop buffer or at least uop cache and don't have to squeeze through the decoders. Use it or don't, REX is paid for.