Hacker News new | ask | show | jobs
by api 1944 days ago
"Performance is agnostic of ISA" is too strong a statement. The variable length instruction encoding is a significant performance disadvantage, as is the strict memory ordering requirement of X86/X64.

X64 decoders are indeed only ~5% of the die on a modern CPU, but it's 5% that is always at 100% utilization. That's a non-trivial amount of extra power. X64 decode parallelism is also limited. I've heard four instructions at once as a magic number beyond which it becomes really hard. This is why hyperthreading (SMT) is so common on X64 chips. It's a "cheat" to keep the pipeline full by decoding two different streams in parallel (allowing 8X parallelism). SMT isn't free though. It drags in a lot of complexity at the register file, pipeline, and scheduler levels, and is a bit of a security minefield due to spectre-style attacks. All that complexity adds more overhead and therefore more power consumption as well as taking up die space that could be used for more cores, wider cores, more cache, etc.

ARM is just a lot easier to optimize and crank up performance than X86. The M1 apparently has 8X wide instruction decode, and with fixed length instructions it would be trivial to take it to 16X or 32X if there was benefit to that. I could definitely imagine something like a 16X wide ARM64 core at 3nm capable of achieving up to 16X instruction level parallelism as well as supporting really wide vector operations at really high throughput. Put like 16 of those on a die and we're really far beyond X64 performance in every category.

This is also why SMT/hyperthreading doesn't really exist in the ARM world. There's less to be gained from it. Better to have a simpler core and more of them.

IMHO X86/X64 has hit a performance wall at least in terms of power/performance, and this time it might be insurmountable due to variable length instructions and associated overhead. It matters in the data center as well as for mobile and laptops. There's a reason AWS is pricing to steer people toward Graviton: it costs less to run. Power is the largest component of most data center costs.

1 comments

While it’s absolutely true that fixed width instructions make parallel decoding vastly easier, there’s a cost in terms of binary footprint size. x86 generally has an advantage in instruction cache and TLB performance for this reason, which can be significant depending on the workload.
Not true. This is a common myth that comes from some old Linus posts in the 32-bit Pentium 4 days and still won't die. I've done comparisons to test this. Compare sizes of modern x86-64 Linux binaries to their counterparts on AArch64. You'll find that they're extremely close.

The biggest problem is all the REX prefixes. The inefficient encoding of registers in x86-64 squandered all the advantages that x86 had.

Is true. They said:

> > x86 generally has an advantage [empahsis added, not "x86-64"]

Obviously if you take the worst of both worlds (bloated and variable-width instructions), you can squander that advantage, but the advantage is in fact real.

Is this still really relevant? I can understood that it can be a problem 20 years ago, but with current processor with huge L1 cache and memory bandwidth, I am starting to think that 4 bytes (or variable 4/8 bytes) is not a bad tradeoff for density Vs superscalar.
L1 size in 1999: 32 kB

L1 size in 2021: 64 kB

The L1 size is yet another place where the x86 legacy hinders things. To avoid aliasing in a virtually indexed L1 cache (which is what you want for performance in a L1 cache, since a physically indexed cache would have to wait for the TLB lookup), the size of each way is limited to the page size, which on x86 is 4096 bytes. To get a 64 KiB L1 cache, it would have to be a 16-way cache, and increasing that too much makes the cache slower and more power-hungry. It's no wonder Apple decided to use a 16 KiB page size instead of a 4 KiB page size; a 64 KiB VIPT L1 cache with 16 KiB page size needs only 4 ways.

For the L1 instruction cache, aliasing shouldn't be a problem (since it's never written to), but this is once again another place where the x86 legacy hinders things: instead of requiring an explicit instruction to invalidate a virtual address in the instruction cache, it's implicitly invalidated when writing to that address.

Apple M1 big core cache sizes:

256KB L1I/128KB L1D

Little cores: 128KB L1I/64KB L1D

Wow. Didn't know that. That should more than compensate for a very slight increase in code size for ARM64 vs X64.

When I use M1, AWS Graviton, or even older Cavium ThunderX chips I can't help but think that X86 is on its way out. The advantage is something you can subjectively see and feel. It's obvious, especially when it comes to power consumption.

Process node has something to do with it, but it's not the whole story. I'm typing on a 10nm Ice Lake MacBook Air and while this chip is better than older 14nm Intel laptops it's still just shockingly crushed by the M1 on every metric. 10nm -> 5nm is not enough to explain that, especially since apparently Intel is more conservative with its numbering and Intel 10nm is more comparable to TSMC 7nm. So it's more like TSMC 7nm vs TSMC 5nm, which is not a large enough gap to account for what seems to be at least 1.5X better performance and 3X better power efficiency.

Some of the X86/X64 apologists remind me of old school aerospace companies dissing not only SpaceX and Blue Origin but the whole idea of reusable rockets, trying to convince us that there's little economic advantage in reusing a $100M rocket stage that consumes ~$100-200K in fuel per launch.

"That's not much of a meteorite. It's no big deal." - Dinosaurs