|
|
|
|
|
by sharpneli
2022 days ago
|
|
Another major difference is the memory model. In X86 other CPU’s must always see the writes of a core exactly in the right order. This limits the ability to reorder store ops significantly. ARM requires a memory barrier for this. This is a major reason why X86 emulation is so slow. One must basically issue a memory barrier after every store op. M1 actually also implements the X86 memory model too in HW. It’s only usable for the rosetta applications and comes with perhaps 20% perf penalty. But it’s still way better than emulating it with barriers. In C++ terms it pretty much means X86 is always seq_cst. With ARM one can actually get benefit of the different memory model options. As an example one can do an atomic access without having to flush the whole store buffer out, which is impossible in X86. Due to the instruction coding and memory model for multicore I don’t really see X86 dominating anymore in the upcoming decades. And as the modern OoO cores are so similar internally it’s not even a big deal in the end. AMD shouldn’t have any issues with producing a Zen arm core. Switch the inst decoder and that’s pretty much it (ton of design work for sure). Keep the X86 mem model optional for emulation and binary translation can be almost thought as just making X86 instructions into fixed width ahead of time. |
|
I had always assumed that the looser memory model must have a performance benefit. But this comment from last week argues that it doesn't really buy that much, and that a bigger buffer can eliminate most of the difference: https://news.ycombinator.com/item?id=25263461
If TSO forces flushing of store buffers for every atomic access, that seems like a substantial disadvantage for x86.