Hacker News new | ask | show | jobs
by NobodyNada 685 days ago
One more: there's more to an ISA than just the instructions; there's semantic differences as well. x86 dates to a time before out-of-order execution, caches, and multi-core systems, so it has an extremely strict memory model that does not reflect modern hardware -- the only memory-reordering optimization permitted by the ISA is store buffering.

Modern x86 processors will actually perform speculative weak memory accesses in order to try to work around this memory model, flushing the pipeline if it turns out a memory-ordering guarantee was violated in a way that became visible to another core -- but this has complexity and performance impacts, especially when applications make heavy use of atomic operations and/or communication between threads.

Simple atomic operations can be an order of magnitude faster on ARMv8 vs x86: https://web.archive.org/web/20220129144454/https://twitter.c...

2 comments

"the only memory-reordering optimization permitted by the ISA is store buffering."

I think this is a mischaracterization of TSO. TSO only dictates the store ordering to other entities in the system, the individual cores are fully capable of using the results of stores that are not yet visible for their own OoO purposes as long as the dataflow dependencies are correctly solved. The complexities of the read/write bypassing is simply to clarify correct program order.

And this is why the TSO/non TSO mode on something like the apple cores doesn't seem to make a huge difference, particularly if one assumes that the core is aggressively optimized for the arm memory model, and the TSO buffering/ordering is not a critical optimization point.

Put another way, a core designed to track store ordering utilizing some kind of writeback merging is going to be fully capable of executing just as aggressively OoO and holding back or buffering the visibility of completed stores until earlier stores complete. In fact for multithreaded lock-free code the lack of explicit write fencing is likely a performance gain for very carefully optimized code in most cases. A core which can pipeline and execute multiple outstanding store fences is going to look very similar to one that implements TSO.

Yes, and Apple added this memory model to their ARM implementation so Rosetta2 would work well.