Hacker News new | ask | show | jobs
by userbinator 2664 days ago
Almost every instruction needs one and it bloats the size tremendously, to the extent that most x86-64 instructions are just about 4 bytes.

That's only if your code happens to be particularly "64-bit-heavy", or the compiler isn't doing a good job at selecting registers; the original designers (at AMD, not Intel) decided on the prefices (and defaulting to 32-bit for most ops) instead of defaulting all operations to 64-bit in 64-bit mode precisely because it would be better for size and performance --- using their carefully optimised compilers. Plus, what can be done with a single 4-byte instruction on x86 can require multiple 4-byte ARM instructions, and that adds up quickly.

I can't find it at the moment but one of the studies I remember comparing the binary sizes was using GCC, which is widely available and free, but probably one of the worst compilers at x86 size optimisation I've seen. I even recall a remark in that study about how it was generating mostly RISC-like instructions, so in other words they were comparing binaries generated for a RISC CPU using a RISC-oriented compiler with ones generated for a CISC CPU using a RISC-oriented compiler, failing to exploit the full capabilities of a CISC.

I've written x86 Asm for several decades (started with 16-bit --- dating myself here...), and done some occasional MIPS and ARM, and it's very difficult for me to believe that the RISCs have any intrinsic advantage in code density other than the fact that compilers for x86 aren't that great at it; you can write a Fibonacci calculator for the latter in 5 bytes and pushes and pops are single-byte instructions, while on the former even a register-register move is 4 bytes.

2 comments

> That's only if your code happens to be particularly "64-bit-heavy", or the compiler isn't doing a good job at selecting registers

No, that's true for basically all code. 6 or 7 registers isn't enough for basically anything interesting, so you end up pretty much always hitting the high registers.

> Plus, what can be done with a single 4-byte instruction on x86 can require multiple 4-byte ARM instructions, and that adds up quickly.

The only real difference is that you have memory load addressing modes in x86, while for load-store architectures like AArch64 you don't. But:

* On x86-64 you have two-address instructions, not three-address instructions. This means that AArch64 "sub x9,x10,x11", or "49 01 0b cb", becomes x86-64 "mov r9,r10; sub r9,r11", or "4d 89 d1 4d 29 d9": 4 bytes vs. 6, thanks to the doubled REX prefix.

* On x86-64 immediates are very inefficiently encoded, while they tend to be compressed on RISCs to fit in the 32-bit instruction word. The end result is that AArch64 "sub x9,x10,#1234", or "49 49 13 d1" in 64-bit mode becomes x86-64 "lea r9,[r10-1234]", which is "4d 8d 8a 2e fb ff ff": 4 bytes vs. 7.

> I can't find it at the moment but one of the studies I remember comparing the binary sizes was using GCC, which is widely available and free, but probably one of the worst compilers at x86 size optimisation I've seen.

LLVM is doing pretty well at x86-64 size optimization: for example, it prefers to select lower registers to reduce size. As I recall, Dan Gohman told me the code size win was something on the order of 2%. It really doesn't make a big difference: AArch64 and x86-64 have about the same code size.

> you can write a Fibonacci calculator for the latter in 5 bytes

But real code, again, hits the high registers.

> pushes and pops are single-byte instructions

Pushes and pops aren't used by most compilers, except in function prologs and epilogs. This is actually an example of inefficiency in the design of x86-64. The opcode space shouldn't go to functions that are only used to set up and tear down functions.

> on the former even a register-register move is 4 bytes

"mov r11,r12" is 3 bytes on x86-64. Not a big difference…

I even recall a remark in that study about how it was generating mostly RISC-like instructions,

There is a very good reason for GCC's x86 backend to do this. Intel/AMD optimisation manuals provide a subset of x86 instructions that are worth using. Instructions that are actually fast in modern designs, that don't fall back to legacy microcode.

This subset looks very RISC-like.

Sure, x86 has CISC instructions that sometimes allow very dense code, but if you want your code to actually run fast you need to do it the RISC way.