In HFT, we typically pin processes to run on a single isolated core (on a multicore machine). That allows the process to avoid a lot of kernel and other interrupts which could cause the process to not operate in a low latency manner.
If we have two of these processes, each on separate cores, and they occasionally need to talk to each other, then knowing the best choice of process/core location can keep the system operating in the lowest latency setup.
So, an app like this could be very helpful for determining where to place pinned processes onto specific cores.
There's also some common rules-of-thumb such as, don't put pinned processes that need to communicate on cores that are separated by the QPI, that just adds latency. Make sure if you're communicating with a NIC to find out which socket has the shortest path on the PCI bus to that NIC and other fun stuff. I never even thought about NUMA until I started to work with folks in HFT. It really makes you dig into the internals of the hardware to squeeze the most out of it.
In general this makes sense, but I think you need to be careful in some cases where the lowest latency between two logical "cores" is likely to be between those which are SMT siblings on the same physical core (assuming you have an SMT-enabled system). These logical "cores" will be sharing much of the same physical core's resources (such as the low-latency L1/L2 and micro-op caches), so depending on the particular workload, pinning two threads to these two logical "cores" could very well result in worse performance overall.
Doesn't this leave some performance on the table? Each core has more ports than a single thread could reasonably use, exactly because two threads can run on a single core
In terms of throughput, technically yes, you are leaving performance on the table. However, in HFT the throughput is greatly limited by IO anyways, so you don't get much benefit with it enabled.
What you want is to minimize latency, which means you don't want to be waiting for anything before you start processing whatever information you need. To do this, you need to ensure that the correct things are cached where they need to be, and SMT means that you have multiple threads fighting each other for that precious cache space.
In non-FPGA systems I've worked with, I've seen dozens of microseconds of latency added with SMT enabled vs disabled.
Maybe 10 years ago that as the common things, but there are so many exta resources (esp registers) that is is now giving up almost half the chip. If you can be cache friendly enough, the extra cycles will make up for it.
No, this is not true at all. "The extra cycles" is the exact thing you want to avoid in HFT. It doesn't matter how much throughput of processing you can put through a single core if you enable SMT, because somewhere in the path (either broker, exchange, or some switch in between) you will eventually be limited in throughput that it becomes irrelevant.
The only thing that matters at that point is latency, and unless you are cache-friendly enough to store your entire program in a single core's cache twice over, you would be better off disabling SMT altogether. And even if you were able to do that, it would not matter as a single thread would be done processing a message by the time the next one comes in. At least at the currently standard 10-25Gbps that the exchanges can handle.
In HFT, we're fine giving up half the registers in a core if it means we get an extra few microseconds of latency back.
What kind of "talking" are we talking about? I thought most IPC works somehow via shared memory under the hood rather than CPU cores actually communicating, how would you even do that?
"Shared memory" is really more of a description of the memory model that is exposed to the programmer, rather than the hardware.
Under the hood, there are caches -- sometimes memory addresses live in a cache above you because you put them there, sometimes they live in a cache above you because a neighboring core that shares your cache put them there, sometimes they live in RAM, sometimes they live in another cache on your chip and you have to ask for them through the on-chip network. The advice I have been given (as a non-HFT guy) is just to try not to mess around to much with the temporal locality, pin threads to cores, and let the hardware handle the rest.
It's mentioned in the readme - this is measuring the latency of cache coherence. Depending on architecture, some sets of cores will be organized with shared L2/L3 cache. In order to acquire exclusive access to a cache line (memory range of 64-128ish bytes), caches belonging to other sets of cores need to be waited on to release their own exclusive access, or to be informed they need to invalidate their caches. This is observable as a small number of cycles additional memory access latency that is heavily dependent on hardware cache design, which is what is being measured
Cross-cache communication may simply happen by reading or writing to memory touched by another thread that most recently ran on another core
Check out https://en.wikipedia.org/wiki/MOESI_protocol for starters, although I think modern CPUs implement protocols more advanced than this (I think MOESI is decades old at this point)
AMD processors also use a hierarchical coherence directory, where the global coherence directory on the IO die enforces coherence across chiplets and a local coherence directory on each chiplet enforces coherence on-die http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/www/le...
The example code uses an atomic store instruction in order to write values from threads to a memory location, and then an atomic read to read them. The system guarantees that a read of a previously written location is consistent with a subsequent write, i.e. "you always read the thing you just wrote" (on x86, this guarantee is called "Total Store Ordering.") Reads and writes to a memory location are translated to messages on a memory bus, and that is connected to a memory controller, which the CPUs use to talk to the memory they have available. The memory controller is responsible for ensuring every CPU sees a consistent view of memory according to the respective platform memory ordering rules, and with respect to the incoming read/write requests from various CPUs. (There are also caches between the DRAM and CPU here but they are just another layer in the hierarchy and aren't so material to the high-level view, because you can keep adding layers, and indeed some systems even have L1, L2, L3, and L4 caches!)
A CPU will normally translate atomic instructions like "store this 32-bit value to this address" into special messages on the memory bus. Atomic operations it turns out are already normally implemented in the message protocol between cores and memory fabric, so you just translate the atomic instructions into atomic messages "for free" and let the controller sort it out. But the rules of how instructions flow across the memory bus is complicated because the topology of modern CPUs is complicated. They are divided, partitioned into NUMA domains, have various caches that are shared or not shared between 1-2-or-4-way clusters, et cetera. They must still obey the memory consistency rules defined by the platform, and all the caches and interconnects between them. As a result, there isn't necessarily a uniform measurement of time for any particular write to location X from a core to be visible to another core when it reads X; you have to measure it to see how the system responds, which might include expensive operations like flushing the cache. It turns out two cores that are very far away will just take more time to see a message, since the bus path will likely be longer -- the latency will be higher for a core-to-core memory write where the write will be visible consistently.
So when you're designing high performance algorithms and systems, you want to keep the CPU topology and memory hierarchy in mind. That's the most important takeaway. From that standpoint, these heatmaps are simply useful ways of characterizing the baseline performance of some basic operations between CPUs, so you might get an idea of how topology affects memory latency.
Hm. Use icelake, with an aggregator process sitting in core 11 and have all the others run completely on input alone and then report to core 11. (Core 11 from that heatmap appears to be the only cpu with a sweetheart core having low latency to all other cores.) I wonder how hard is to write a re-writer to map an executable to match cpu architecture characteristics. Something like graph transformations to create clusters (of memory addresses) that are then mapped to a core.
If we have two of these processes, each on separate cores, and they occasionally need to talk to each other, then knowing the best choice of process/core location can keep the system operating in the lowest latency setup.
So, an app like this could be very helpful for determining where to place pinned processes onto specific cores.
There's also some common rules-of-thumb such as, don't put pinned processes that need to communicate on cores that are separated by the QPI, that just adds latency. Make sure if you're communicating with a NIC to find out which socket has the shortest path on the PCI bus to that NIC and other fun stuff. I never even thought about NUMA until I started to work with folks in HFT. It really makes you dig into the internals of the hardware to squeeze the most out of it.