Hacker News new | ask | show | jobs
by snvzz 1103 days ago
Strong memory ordering is convenient for the programmer.

But it is a no-go for SMP scalability.

That's why most architectures today use weak ordering.

x86 is alone and a dinosaur.

3 comments

I’m curious to hear what problems are you thinking of in particular that make it no-go? Strong model has challenges, but I am not aware of any total showstoppers.

x86 has also illustrated the triangle, garnering some weakly ordered benefits with examples like avx512 and enhanced rep movsb.

The interesting thing is both solutions (weak ordering, special instructions) have been largely left to the compiler to manage, so it could become a question of which the compiler is better able to leverage. For example, if people are comfortable programming MP code in C on a strong memory model but reach for python on a weak memory model, things could shake out differently than expected.

> Strong memory ordering […]

> But it is a no-go for SMP scalability.

The SPARC architecture introduced the TSO and defaults to the total store order, and SPARC (and later UltraSPARC) systems were one the first successful highly scaleable SMP implementations.

Sun Enterprise 10000 (E10k) servers could be configured with up to 64x CPU's (initially released in 1997), Sun Fire 15K (available in 2002) could support up to 106x CPU's, Sun Fire E25K released in 2004 could support up to 72x dual-core CPU's (144x CPU cores in total).

SPARC survives (albeit not frequently heard about today) as Oracle SPARC T8-4 and M8-8 (8x CPU's, 32x CPU cores each, 256 threads per core) and Fujitsu[0] SPARC M12-2S (32x CPU's, 384 cores on each CPU and 3072 CPU threads).

All of the above is SMP and very many CPU's, CPU cores and CPU threads.

A succeful, scaleable, SMP architecture has to get the cache coherence protocols right irrespective of whether the ISA implements the TSO, is weakly ordered or a hybrid approach.

To ensure the cache coherence in a TSO UltraSPARC SMP architecture, Sun E10k realised a threefold approach: 1) it broadcast cache coherence on a logical bus (as opposed to a physical bus) shaped as a tree where all CPU's were leaves with all links between them being point-to-point, 2) greater coherence request bandwidth could be achieved by using multiple logical buses, whilst still maintaining a total order of coherence requests (the E10k had four logical buses, and coherence requests were address-interleaved across them, and 3) data response messages, which are much larger than request messages, did not require the totally ordered broadcast network required for coherence requests.

The E10k scaled exceptionally well in its SMP setup whilst using the TSO. It was also highly performant in its prime time with the successor Sun Fire family improving even further.

Therefore, the strong memory ordering being a no-go for the SMP scalability statement is bunk.

[0] And Fujitsu has been a well known poster child of making massively scaleable, (Ultra-)SPARC based supercomputing systems for a very long time as well.

Memory ordering tends not to play much into the design issues of xMP systems. As long as you have a coherent and properly scalable cache and NoC, the actual memory ordering of the local processor is irrelevant to the total performance of the system since the LSU and L1 cache are (typically) responsible from providing ordering. The reason why most architectures use weaker memory ordering rules is that it allows you to more easily build faster individual cores as it makes it much easier to extract memory parallelism.
Yep. There are plenty of Intel processors that have plenty of cores (and multiple sockets, even).