| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aseipp 1412 days ago

The example code uses an atomic store instruction in order to write values from threads to a memory location, and then an atomic read to read them. The system guarantees that a read of a previously written location is consistent with a subsequent write, i.e. "you always read the thing you just wrote" (on x86, this guarantee is called "Total Store Ordering.") Reads and writes to a memory location are translated to messages on a memory bus, and that is connected to a memory controller, which the CPUs use to talk to the memory they have available. The memory controller is responsible for ensuring every CPU sees a consistent view of memory according to the respective platform memory ordering rules, and with respect to the incoming read/write requests from various CPUs. (There are also caches between the DRAM and CPU here but they are just another layer in the hierarchy and aren't so material to the high-level view, because you can keep adding layers, and indeed some systems even have L1, L2, L3, and L4 caches!)

A CPU will normally translate atomic instructions like "store this 32-bit value to this address" into special messages on the memory bus. Atomic operations it turns out are already normally implemented in the message protocol between cores and memory fabric, so you just translate the atomic instructions into atomic messages "for free" and let the controller sort it out. But the rules of how instructions flow across the memory bus is complicated because the topology of modern CPUs is complicated. They are divided, partitioned into NUMA domains, have various caches that are shared or not shared between 1-2-or-4-way clusters, et cetera. They must still obey the memory consistency rules defined by the platform, and all the caches and interconnects between them. As a result, there isn't necessarily a uniform measurement of time for any particular write to location X from a core to be visible to another core when it reads X; you have to measure it to see how the system responds, which might include expensive operations like flushing the cache. It turns out two cores that are very far away will just take more time to see a message, since the bus path will likely be longer -- the latency will be higher for a core-to-core memory write where the write will be visible consistently.

So when you're designing high performance algorithms and systems, you want to keep the CPU topology and memory hierarchy in mind. That's the most important takeaway. From that standpoint, these heatmaps are simply useful ways of characterizing the baseline performance of some basic operations between CPUs, so you might get an idea of how topology affects memory latency.