|
Anyways, since you express disappointment in ~1000 cycle cost. That's about right. The latency between cores is actually quite high and there's not much you can do about it, especially on a system like x86 which has extremely strong cache coherency by default. One thing that is really important to understand is that the ownership of cache lines dramatically affects the cost of memory. For IPC, this effectively requires one thread writing to memory (thus, making that cache line modified to that core, and evicted from all other cores). Then, when the polling thread checks in on the line, it will have to demote that cache line from modified, to shared by flushing it out to memory (usually L2 or L3, but also writing out to memory). This causes some memory traffic, and constantly means that the cores are fighting over the same cache line. Since x86 is strongly ordered and caches are coherent, this traffic is extremely expensive. Think of a write as "tell all other cores that you modifed this memory, so they have to evict/invalidate their cache lines". And a read as "tell all cores to flush their cache lines if they're modified, then wait for them to tell me they're done, then I can read the memory". This effectively is a massively blocking operation. The simple act of reading the mailbox/ticket/whatever from another core to check if a message is ready will actually dramatically affect the speed the other core can write to it (as now that write is effectively full latency). There are some tricks you can do to get extremely low latency between cores. One of them, is making sure you're on cores that are physically near each other (eg. on the same processor socket). This is only really relevant on servers, but it's a big thing. You can actually map out the physical processor layout, including on a single die, based on the latency between these cores. It's quite subtle and requires low noise, but it's really cool to map out the grid of cores on the actual silicon due to timing. Another trick that you can do, is have both threads on the same core, thus, using hyperthreads. Hyperthreads share the same core and thus a lot of resources, and are able to actually skip some of the more expensive coherency traffic, as they share the same L1 cache (since L1 is per-core). The lowest latency you will be able to observe for IPC will be on the same core with hyperthreads, but that's often not really useful for _processing_ the data, since performance will not be great on two busy cores. But in theory, you can signal a hyperthread, the hyperthread can then go and raise some other signal, while the original hyperthread still continues doing some relevant work. As long as one of them is blocking/halted, the other won't really be affected by two things on the same thread. Finally, the most reasonable trick, is making sure your tickets/buffers/mailboxes/whatever are _not_ sharing the same cache lines (unless they contain data which is passed at the same time). Once again, the CPU keeps things in sync at cache line levels. So having two pieces of data being hammered by two cores on the same cache line is asking for hundreds of cycles per trivial data access. This can be observed in an extreme case with many core systems, with multiple sockets, fighting over locks. I've done this on my 96C/192T system and I've been able to get single `lock inc [mem]` instructions to take over 15,000 cycles to complete. Which is unreal for a single instruction. But that's what happens when there's 200-500 cycles of overhead every single time that cache line is "stolen" back from other cores. So, effectively, keep in your head which state cache lines will be in. If they're going to be modified on one core, make sure they're not going to be read on another core while still being written. These transitions are expensive, you're only going to get your 3-4 cycle "random L1 data hit performance" if the cache line is being read, and it's in the exclusive, modified, or shared state, and if it's being written, it has to be exclusive or modified. Anything else and you're probably paying hundreds of cycles for the access, and thus, also probably hurting the other side. Ultimately, what you're asking from the CPU is actually extremely complex. Think about how hard it would be for you to manage keeping a copy of a database in sync between hundreds of writers and reads (cores). The CPU is doing this automatically for you under the hood, on every single memory access. It is _not_ free. Thus, you really have to engineer around this problem, batch your operations, find a design that doesn't require as intense of IPC, etc. On more weakly ordered systems you can use some more tricks in page tables to get a bit more control over how cache coherency should be applied for various chunks of memory to get more explicit control. |