| HN Mirror

Even on a single core, there are several copies in different cache layers and synchronizing them is done by sending asynchronous messages. Sure, in that one particular edge case when the threads are sharing a core you're right, but this is not a typical scenario for multi-threaded applications. Most of the time for high multi-threaded performance you want exactly opposite - one thread per core and pinning threads to cores. And if you don't do anything, you can never be sure if your threads run on the same core or not and you should assume the worst.

> And it's mostly software design that initiates the slowdowns, not the CPU.

This is quite vague statement and I'm not sure what you really meant here. Software written using a simplified abstraction model (e.g. flat memory with stuff shared between threads, ordered sequential execution) much different than the way how CPU really works (hierarchical memory, out-of-order execution, implicit parallelism etc.) is very likely to cause "magic" slowdowns. See e.g. false-sharing.

Also algorithms designed around the concept of shared mutability do not scale. Sure, you may hide some of the problems with reordering, out-of-order, etc. To some degree it will help, but not when you go to scale of several thousands cores in a geographically distributed system.