Hacker News new | ask | show | jobs
by sharpneli 2022 days ago
Another major difference is the memory model. In X86 other CPU’s must always see the writes of a core exactly in the right order. This limits the ability to reorder store ops significantly. ARM requires a memory barrier for this. This is a major reason why X86 emulation is so slow. One must basically issue a memory barrier after every store op.

M1 actually also implements the X86 memory model too in HW. It’s only usable for the rosetta applications and comes with perhaps 20% perf penalty. But it’s still way better than emulating it with barriers.

In C++ terms it pretty much means X86 is always seq_cst. With ARM one can actually get benefit of the different memory model options. As an example one can do an atomic access without having to flush the whole store buffer out, which is impossible in X86.

Due to the instruction coding and memory model for multicore I don’t really see X86 dominating anymore in the upcoming decades.

And as the modern OoO cores are so similar internally it’s not even a big deal in the end. AMD shouldn’t have any issues with producing a Zen arm core. Switch the inst decoder and that’s pretty much it (ton of design work for sure). Keep the X86 mem model optional for emulation and binary translation can be almost thought as just making X86 instructions into fixed width ahead of time.

3 comments

I am trying to wrap my head around whether ARM's looser memory model is a fundamental performance advantage or not.

I had always assumed that the looser memory model must have a performance benefit. But this comment from last week argues that it doesn't really buy that much, and that a bigger buffer can eliminate most of the difference: https://news.ycombinator.com/item?id=25263461

If TSO forces flushing of store buffers for every atomic access, that seems like a substantial disadvantage for x86.

It has to flush them. Because if another core sees the result of the atomic op it must also see everything else that the other core wrote before the op. While it can indeed first see no writes and then suddenly all it can never see just the atomic op and not the previous writes.

Without that the store buffers can be kept unflushed to, as an example, see if one can get a full cacheline or whatnot and only flush then.

The comment is correct that an X86 with heavy reordering backend will beat arm without one. However arm with one does handily beat X86 with one. Case in point: M1

Interesting, are you suggesting that a large part of the M1 performance advantage is thanks to the weaker ARM memory model?

Is the 20% perf hit of TSO mode that you cite an ARM vs. ARM comparison? If so, that would be pretty damning.

Is there an easy way to flip the M1 into TSO mode for benchmarking? I would love to observe this 20% for myself.

Large part yes. But not The reason. It’s fast because of many things like that. TSO doesn’t affect single core perf much so it’s not really a factor there, and yet it’s blazingly fast. However the multicore perf is really great too.

I haven’t verified the exact numbers myself. And it will depend on the exact thing you’re running. It’s just on the order of low tens of percents.

TSO cannot be enabled outside of rosetta as it’s not exactly a good arm extension. Perhaps you could do some trickery but Apple likely prevents that.

However you can test it by making something where you know rosetta generates comparable arm assembly from the X86 one and just run comparison that way. Some sort of parallel lockfree algorithm would be the best candidate.

TSO is possible to enable outside of Rosetta with some shenanigans in the kernel. Unfortunately getting Rosetta to generate code that is comparable with what a compiler would create is quite difficult: it needs to lift x86 into its own IR and then re-do register allocation, which it is quite good at but obviously not perfect.
I haven't done exact measurements, but I don't think the cost of enabling TSO is anywhere near as high as 20%. On the contrary, I don't think I have noticed a real difference; perhaps it is but a couple percent slower.
Just want to add one thing, x86 has stronger memory “semantics”. So, it doesn’t have to work that way behind the scenes, just at the end of the block, it has to appear it worked that way. So, x86 does reordering, store combining etc a lot. IMHO, performance difference between arm vs x86 barely related with ISA, in M1 case, it’s definitely not, a lot more going on than just taking advantage of weaker memory model.
Having to appear worked that way does cause restrictions in multiprocessor case. ARM chips naturally do all of that too, with the memory model simply giving them way more freedom to reorder things.

One couldn’t do X86 version of M1, mostly because there is no way of making an instruction decoder that wide for it.

And the performance penalty of M1 when working in TSO mode strongly implies that yes the weaker memory model indeed plays a major role. Not the biggest, but definitely not insignificant. Tens of percents here and tens of percents there combined become a ridiculous perf boost.