| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Nursie 2508 days ago
	Most devs I know are creating async workloads which don't require cache coherency, as they use parallelism to parallel process separate requests and workloads. I can see things being pretty linear in that sort of space.

2 comments

cjhanks 2508 days ago

They are not linear unless all requests take an identical amount of time OR the system is not oversubscribed (common in many workloads) - and even then, the current linux CFS scheduler has a complexity of `O(log N)`.

When you have variable length requests, you will find cores will not always be balanced, it is simply a statistical reality. And in those cases, the kernel will have to migrate your process to a different core, and if you have 256 cores, that core might be really far away.

link

gmueckl 2508 days ago

Except that they are typically not. The Zen architectures are NUMA and controlling where memory is allocated is key to decent threaded performance. You may even have to do seemingly counterintuitive things like duplicating central data structures across nodes and other tricks from the distributed systems playbook.

link

coder543 2508 days ago

Epyc 2's memory layout is not like Epyc 1. Epyc 2 is very simple.

link

jdsully 2508 days ago

Yup everything is equally slow now. Kinda sad, but the original NUMA design was treated as a glass half empty situation instead of AMD letting people maximize performance. This change lets them avoid the bad press and everyone is happier despite the final design being slower than it could have been.

link

gmueckl 2508 days ago

Epyc 2 has different memory latencies within and across NUMA nodes according to the infirmation I have. So it is not equally slow for all memory. Can you point me to a source that says otherwise?

Edit: my source is this German article: https://www.heise.de/newsticker/meldung/AMD-Server-CPUs-Epyc...

link

jdsully 2508 days ago

See the architecture diagram here: https://www.anandtech.com/show/14694/amd-rome-epyc-2nd-gen/2

Everything goes through the central crossbar on the I/O die, where Zen1 had memory attached directly to each CPU chiplet which would relay as necessary. On Zen1 if you accessed direct attached memory you wouldn't pay the latency penalty from relaying the data. In Zen2 all data is relayed via the I/O die with the associated delay that entails.

link

gmueckl 2508 days ago

I did some more digging. It seems like the Linux NUMA topology shown in the anandtech article is a deliberate lie. There are different latencies between cores and memory comtrollers on the same socket, but these are deemed to be insignificant enough to not expose them in the reported NUMA topology.

link

wmf 2508 days ago

Epyc 1 was NUMA within the socket while Epyc 2 is officially UMA within the socket (although not really). Unfortunately Epyc memory latency is much higher than Intel so it's fair to call it uniformly slow.

link

Jweb_Guru 2506 days ago

Yeah, I actually was not so happy with the benchmarks because the memory access latency is not all that good... for most of the workloads that I care about, I don't know that the Epyc will be faster than a Xeon.

link