Hacker News new | ask | show | jobs
by mrkgnao 2867 days ago
> Because the two NUMA nodes are ~entirely independent, it's capable of running two independent processes at full speed.

I don't understand. From my (admittedly little better than layperson's) knowledge, I'm guessing the cores of most multicore processors have to compete for memory access...? Is there a good search term I can use to help me understand what's going on here?

3 comments

There are 2 dies in the 1950X, each one has 2 memory channels. Thus, it's possible to run a process on one (8-core) die that maxes out the memory bandwidth to it's two local DDR4 channels while the other die still has full bandwidth access to it's own DDR4 channels.

Threadripper is able to switch between NUMA (non-uniform memory access) mode and "regular" mode. In NUMA, the OS knows that 2 channels are attached to 1 die and 2 channels on the other, thus allowing lower latencies because the OS knows what RAM to allocate based on which core the process is running on.

As a bonus, if you are explicitly NUMA & the OS/code does a good job, there's little line contention or resource sharing (e.g., caches) between die.
I found a significant performance benefit to keeping NUMA turned on when running Linux, for basically every workload.

For Windows, it is the other way around. I hope they'll improve their NUMA handling, but I'm not holding my breath.

The Linux kernel is clever about this. You can get some idea of what it does by looking at numactl, which lists the various scheduling modes -- though in practice the kernel does a great job without any user overrides, and actually using the command is likely to slow things down.

Which is not to say that it can't occasionally be helpful, if you're trying to optimize the speed of a single thread. At a minimum, you can choose between optimizing for bandwidth (interleaving data on all four memory channels) or latency (putting everything in the local node). Usually you want the latter.

Does it really work this way (with automatic memory and core pinning)? Both Windows and Linux can do that?
Linux does that. Windows...

Judging by the performance I'm (not) getting, Windows does a very poor job with NUMA.

Does Linux do that OOB or do I need to do some configuration?
I never configured anything.
NUMA means "(Explicitly) Non-Uniform Memory Access"; this means that some cores have easier (lower latency, higher bandwidth) access to some memory regions than others.

In practice this means that memory controllers are partitioned amongst groups of cores, with some slower and often otherwise busy interconnect between those groups.

The software implication is that if task X uses some bit of memory a lot, then that bit of memory better be node-local, i.e. easy to access for the core where task X is running.

Threadripper and Epyc are essentially multi-socket-in-a-package. There is an inter-processor link which is analogous to Intel QPI or DMI, it just runs between dies within a single socket instead of dies in separate sockets.

Threadripper and Epyc present themselves as 2 or 4 separate NUMA nodes depending on model. Spreading a single task across multiple NUMA nodes usually hurts performance significantly (often slower than just running it on a single node using fewer threads), but you can run 2/4 separate tasks at pretty much full speed.

The new WX processors are a little weird because two of the NUMA nodes have no direct access to RAM at all, they have to ask the other 2 dies to do it for them and pass it over.