Hacker News new | ask | show | jobs
by CountSessine 3513 days ago
This is a fantastic resource; kudos to the author. But there is one thing in this reference that I found unexpected:

One further thing which is related to memory accesses and performance, is rarely observed on desktops (as it requires multi-socket machines – not to be confused with multi-core ones ... When multiple sockets are involved, modern CPUs tend to implement so-called NUMA architecture, with each processor (where “processor” = “that thing inserted into a socket”) having its own RAM

I thought that all Intel chips since Nehelem divided their SDRAM access into a NUMA-configuration based on cores? Am I wrong about that?

2 comments

NUMA typically affects multi-socket machines only. An exception would be high end Xeon chips since Haswell when used in a cluster-on-die configuration, but you won't find that in a desktop PC. Each socket in a multi-socket system has its own memory, and when a remote CPU accesses the memory of another CPU, it pays a fairly hefty latency penalty compared to accessing its own memory.
Just thinking about it, I guess this makes perfect sense - all memory access on a socket will converge on a common L3 cache and it would be just bizarre if somehow each core would do a 'private write-through' somehow to it's own SDRAM.
I don't think there's much NUMA action on single socket at the moment, but as CPU area increases and more of the transistors are not actually doing CPU work (to spread out the heat-making bits) which increases distances on a single die, this will change.
Unless there is core-specific RAM on the die, why? Isn't the essential aspect of NUMA the fact that there is some memory which is "near", and some which is "far"?
Yeah, as distance (latency) to RAM increases, the amount of on-die cache increases (another handy way to distribute heat with a performance bonus) and coherency becomes more costly, so in effect it becomes the core-specific RAM you mention.

(Oh, hi Kiko!)

Hey Jeff.. the nick fonts are really small on HN; I didn't see it was you!

I don't think latency to near RAM will increase; it would have too material a performance impact. Even in disaggregated designs like Rackscale there is definitely a concept of "near RAM", which is not cache, but which has very low latency.

However, your post made me realize that as the number of cores go up, as with KNL, they are likely to be organized hierarchically with some clustered sharing of cache, so indeed NUMA-style affinity of workload to core starts paying off there. IOW, if you have thousands of cores on a chip, they definitely aren't going to be all sharing the same L2 and L3.

Are core specific caches write through or write Back?
I think that they're always write-back, but it gets complicated with the different levels and it's different between Intel and AMD - my understanding is that Intel uses an 'inclusive' method where everything that's in a higher-level cache line will also be in a lower-level cache line; i.e. if something is in L1 it will always also be in L2 and L3, but then with AMD's exclusive scheme L1 or L2, when clearing a dirty line, can/will effectively 'write-around' L2 or L3 straight to a lower level.