Hacker News new | ask | show | jobs
by oblio 2022 days ago
Well, two main reasons:

1. Innovator's dilemma: https://en.wikipedia.org/wiki/The_Innovator%27s_Dilemma

ARM in this case is the underdog, attacking the incumbent x86. It is the lesser tech because it started from "below" (lower value niches not taken by the incumbent which prefers higher profit margins).

2. RISC vs CISC has never been settled. Until Apple (and Amazon super recently) produced their super recent architectures based on ARM, x86 was considered superior. Yes, RISC was theoretically better in the 80s and later 90s, but x86 has been reinvented successfully 2 times (micro-ops and x86_64) and scaled from 0.2MHz to 5Ghz. It's also a mix of CISC and RISC, it's not pure CISC now. Same for ARM, they've added various instructions which bring it closer to CISC.

As far as I can see the only real difference these days is the constant length of instructions for ARM: https://erik-engheim.medium.com/why-is-apples-m1-chip-so-fas...

5 comments

Another major difference is the memory model. In X86 other CPU’s must always see the writes of a core exactly in the right order. This limits the ability to reorder store ops significantly. ARM requires a memory barrier for this. This is a major reason why X86 emulation is so slow. One must basically issue a memory barrier after every store op.

M1 actually also implements the X86 memory model too in HW. It’s only usable for the rosetta applications and comes with perhaps 20% perf penalty. But it’s still way better than emulating it with barriers.

In C++ terms it pretty much means X86 is always seq_cst. With ARM one can actually get benefit of the different memory model options. As an example one can do an atomic access without having to flush the whole store buffer out, which is impossible in X86.

Due to the instruction coding and memory model for multicore I don’t really see X86 dominating anymore in the upcoming decades.

And as the modern OoO cores are so similar internally it’s not even a big deal in the end. AMD shouldn’t have any issues with producing a Zen arm core. Switch the inst decoder and that’s pretty much it (ton of design work for sure). Keep the X86 mem model optional for emulation and binary translation can be almost thought as just making X86 instructions into fixed width ahead of time.

I am trying to wrap my head around whether ARM's looser memory model is a fundamental performance advantage or not.

I had always assumed that the looser memory model must have a performance benefit. But this comment from last week argues that it doesn't really buy that much, and that a bigger buffer can eliminate most of the difference: https://news.ycombinator.com/item?id=25263461

If TSO forces flushing of store buffers for every atomic access, that seems like a substantial disadvantage for x86.

It has to flush them. Because if another core sees the result of the atomic op it must also see everything else that the other core wrote before the op. While it can indeed first see no writes and then suddenly all it can never see just the atomic op and not the previous writes.

Without that the store buffers can be kept unflushed to, as an example, see if one can get a full cacheline or whatnot and only flush then.

The comment is correct that an X86 with heavy reordering backend will beat arm without one. However arm with one does handily beat X86 with one. Case in point: M1

Interesting, are you suggesting that a large part of the M1 performance advantage is thanks to the weaker ARM memory model?

Is the 20% perf hit of TSO mode that you cite an ARM vs. ARM comparison? If so, that would be pretty damning.

Is there an easy way to flip the M1 into TSO mode for benchmarking? I would love to observe this 20% for myself.

Large part yes. But not The reason. It’s fast because of many things like that. TSO doesn’t affect single core perf much so it’s not really a factor there, and yet it’s blazingly fast. However the multicore perf is really great too.

I haven’t verified the exact numbers myself. And it will depend on the exact thing you’re running. It’s just on the order of low tens of percents.

TSO cannot be enabled outside of rosetta as it’s not exactly a good arm extension. Perhaps you could do some trickery but Apple likely prevents that.

However you can test it by making something where you know rosetta generates comparable arm assembly from the X86 one and just run comparison that way. Some sort of parallel lockfree algorithm would be the best candidate.

TSO is possible to enable outside of Rosetta with some shenanigans in the kernel. Unfortunately getting Rosetta to generate code that is comparable with what a compiler would create is quite difficult: it needs to lift x86 into its own IR and then re-do register allocation, which it is quite good at but obviously not perfect.
I haven't done exact measurements, but I don't think the cost of enabling TSO is anywhere near as high as 20%. On the contrary, I don't think I have noticed a real difference; perhaps it is but a couple percent slower.
Just want to add one thing, x86 has stronger memory “semantics”. So, it doesn’t have to work that way behind the scenes, just at the end of the block, it has to appear it worked that way. So, x86 does reordering, store combining etc a lot. IMHO, performance difference between arm vs x86 barely related with ISA, in M1 case, it’s definitely not, a lot more going on than just taking advantage of weaker memory model.
Having to appear worked that way does cause restrictions in multiprocessor case. ARM chips naturally do all of that too, with the memory model simply giving them way more freedom to reorder things.

One couldn’t do X86 version of M1, mostly because there is no way of making an instruction decoder that wide for it.

And the performance penalty of M1 when working in TSO mode strongly implies that yes the weaker memory model indeed plays a major role. Not the biggest, but definitely not insignificant. Tens of percents here and tens of percents there combined become a ridiculous perf boost.

I think "RISC" was, for a stretch of time, "superior", but it was always in very expensive and very niche workstations and servers.

x86 was itself a kind of underdog in that space, losing out in sheer performance and bandwidth to the high-end RISC market, but making up for it by, you know, being cheap enough that "regular people" could go and buy a family computer with an x86 processor.

Intel managed to slowly improve the various deficiencies over time, and produced a processor not only affordable, but also superior to the incumbent RISC chips in all but niche areas.

I think more than the ISA, Apple's M1 manages to be so good at its job for similar reasons to the old RISC chips -- it's built from the ground-up with a pretty specific target application, rather unlike x86, which has to be all things to all people, with legacy support and scaling from laptops to quad socket 3U servers.

The reinvention that gave x86 the advantage over (workstation) RISC was OoO execution in the Pentium.

It’s a bit silly to think of x86 and ARM as being so different these days. Most x86 code looks a lot like it was produced for a RISC chip and ARM has been gaining some more complex instructions and addressing modes. Academics may have felt that RISC was better than CISC for a long time but I don’t think they were predicting the world of today so much as they were incorrectly predicting the near past. If RISC were so much better then we’d have lots of workstations running on modern Alpha or PA-RISC systems. But we don’t see that.

Ah, yeah, forgot to mention Out of Order Execution.

> Academics may have felt that RISC was better than CISC for a long time but I don’t think they were predicting the world of today so much as they were incorrectly predicting the near past.

Plus you know, academics have been known to be wrong. That's why there's even a saying for it: science advances, one funeral at a time. People are emotional and get attached to their pet theories. There are many examples of this.

It's going to be interesting to see how CPU/GPU tech advances the next few years.

> Most x86 code looks a lot like it was produced for a RISC chip

Well yes I don't think the x86 BCD instructions are still being used liberally, or AAA, or CONS/SCAS etc

It was the PPro and OOO processors were being developed by lots of companies and teams at that time. Wikipedia indicates it was featured in the PowerPC 601 (93), SPARC64 (95), PPro (95), MIPS R10000 (96), etc. Hard to see where was the advantage over workstations on that point...
But ARM64 instructions are all fixed width still right?

And how specifically are these instructions complex? Do they produce an unusually high number of micro-ops when decided?

Yeah CISC vs RISC thing barely makes sens anymore. It could only matter in the context of hand-programmed and low frequency processors. What remains of "it" (at least in some people's mind) that is still relevant today on major ISA is clearly instruction encoding, but you could totally make a CISC with fixed length instruction. You could also make a RISC with (highly) variable length but... just why? And actually why make a highly variable length ISA at all regardless of if the ISA is RISC or CISC? The real reason that x86 has an highly variable length is only historical. When you only decoded one at a time it did not matter that much. So maybe you wanted some big ones for convenience but it would be a shame to make them all big. So the 8086 had instructions of 1 to 6 bytes (up to 10 with prefix?). And then you had 32-bits with 16-bits compat, it seems it went up to 15 bytes, and then stayed at 15 max for AMD64...

Today the most used x86 instructions are not that much different from ARM ones, and in a good number of cases actually even simpler. The simplest way to compare is to simply look at the assembly an optimizing native compiler emits, and lookup for what the emitted instructions are doing.

The micro-ops of processor are actually quite dependent on the ISA. You could do neutral micro-ops, but I doubt this would be very efficient. So you can't really compare the complexity of ISA by the number of micro-ops issued, because among different arch and microarch the micro-ops are themselves more complex on some point and less on others, with tons of similarities with (core instruction set of) the ISAs they implement.

Complex instruction for backward compat, special, or intrinsically difficult operations, are transformed to a potentially very high number of micro-ops, often looked-up in a ROM.

What aspect added to x86 is RISC like? It sounds a bit like an oxymoron. If you start with a complex instruction set, you cannot make it less complex by adding new instructions.
Generally, this is (poorly) used to refer to x86 decoding things to micro-ops.
>It's also a mix of CISC and RISC, it's not pure CISC now.

It's not. It's microcoded. Like VAX. Which RISC is a response to.

CISC and RISC are mostly an ISA concept.

Most chips are "microcoded", at least here and there. The term does not define a whole micro-architecture, esp in modern complex ones, but I'm not really sure how you would implement old school CISC chips without microcode concepts. The 8086 was microcoded: https://www.reenigne.org/blog/8086-microcode-disassembled/. It basically means that "you" program some low level internal details of a chip, e.g. the muxes connecting various buses to execution units. Most of the time the "you" is only the chip designers, and it is stored mostly in a ROM. Sometimes microcoding is accessible by actual software programmers, but it is quite rare.

And modern u-ops are not necessarily like what was done at the time of old-school microcode.

Micro ops?
That's two completely different things. Every x86 instruction is decomposed into one or several micro-ops that are executed. Hardly any x86 instructions are microcoded, which means that the decoder has to start streaming micro-ops from the microcode ROM.