|
|
|
|
|
by sharpneli
2021 days ago
|
|
It has to flush them. Because if another core sees the result of the atomic op it must also see everything else that the other core wrote before the op. While it can indeed first see no writes and then suddenly all it can never see just the atomic op and not the previous writes. Without that the store buffers can be kept unflushed to, as an example, see if one can get a full cacheline or whatnot and only flush then. The comment is correct that an X86 with heavy reordering backend will beat arm without one. However arm with one does handily beat X86 with one. Case in point: M1 |
|
Is the 20% perf hit of TSO mode that you cite an ARM vs. ARM comparison? If so, that would be pretty damning.
Is there an easy way to flip the M1 into TSO mode for benchmarking? I would love to observe this 20% for myself.