| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sharpneli 2021 days ago

It has to flush them. Because if another core sees the result of the atomic op it must also see everything else that the other core wrote before the op. While it can indeed first see no writes and then suddenly all it can never see just the atomic op and not the previous writes.

Without that the store buffers can be kept unflushed to, as an example, see if one can get a full cacheline or whatnot and only flush then.

The comment is correct that an X86 with heavy reordering backend will beat arm without one. However arm with one does handily beat X86 with one. Case in point: M1

1 comments

haberman 2021 days ago

Interesting, are you suggesting that a large part of the M1 performance advantage is thanks to the weaker ARM memory model?

Is the 20% perf hit of TSO mode that you cite an ARM vs. ARM comparison? If so, that would be pretty damning.

Is there an easy way to flip the M1 into TSO mode for benchmarking? I would love to observe this 20% for myself.

link

sharpneli 2021 days ago

Large part yes. But not The reason. It’s fast because of many things like that. TSO doesn’t affect single core perf much so it’s not really a factor there, and yet it’s blazingly fast. However the multicore perf is really great too.

I haven’t verified the exact numbers myself. And it will depend on the exact thing you’re running. It’s just on the order of low tens of percents.

TSO cannot be enabled outside of rosetta as it’s not exactly a good arm extension. Perhaps you could do some trickery but Apple likely prevents that.

However you can test it by making something where you know rosetta generates comparable arm assembly from the X86 one and just run comparison that way. Some sort of parallel lockfree algorithm would be the best candidate.

link

saagarjha 2021 days ago

TSO is possible to enable outside of Rosetta with some shenanigans in the kernel. Unfortunately getting Rosetta to generate code that is comparable with what a compiler would create is quite difficult: it needs to lift x86 into its own IR and then re-do register allocation, which it is quite good at but obviously not perfect.

link