| HN Mirror

"the only memory-reordering optimization permitted by the ISA is store buffering."

I think this is a mischaracterization of TSO. TSO only dictates the store ordering to other entities in the system, the individual cores are fully capable of using the results of stores that are not yet visible for their own OoO purposes as long as the dataflow dependencies are correctly solved. The complexities of the read/write bypassing is simply to clarify correct program order.

And this is why the TSO/non TSO mode on something like the apple cores doesn't seem to make a huge difference, particularly if one assumes that the core is aggressively optimized for the arm memory model, and the TSO buffering/ordering is not a critical optimization point.

Put another way, a core designed to track store ordering utilizing some kind of writeback merging is going to be fully capable of executing just as aggressively OoO and holding back or buffering the visibility of completed stores until earlier stores complete. In fact for multithreaded lock-free code the lack of explicit write fencing is likely a performance gain for very carefully optimized code in most cases. A core which can pipeline and execute multiple outstanding store fences is going to look very similar to one that implements TSO.