Hacker News new | ask | show | jobs
by Osiris 2872 days ago
There are 2 dies in the 1950X, each one has 2 memory channels. Thus, it's possible to run a process on one (8-core) die that maxes out the memory bandwidth to it's two local DDR4 channels while the other die still has full bandwidth access to it's own DDR4 channels.

Threadripper is able to switch between NUMA (non-uniform memory access) mode and "regular" mode. In NUMA, the OS knows that 2 channels are attached to 1 die and 2 channels on the other, thus allowing lower latencies because the OS knows what RAM to allocate based on which core the process is running on.

2 comments

As a bonus, if you are explicitly NUMA & the OS/code does a good job, there's little line contention or resource sharing (e.g., caches) between die.
I found a significant performance benefit to keeping NUMA turned on when running Linux, for basically every workload.

For Windows, it is the other way around. I hope they'll improve their NUMA handling, but I'm not holding my breath.

The Linux kernel is clever about this. You can get some idea of what it does by looking at numactl, which lists the various scheduling modes -- though in practice the kernel does a great job without any user overrides, and actually using the command is likely to slow things down.

Which is not to say that it can't occasionally be helpful, if you're trying to optimize the speed of a single thread. At a minimum, you can choose between optimizing for bandwidth (interleaving data on all four memory channels) or latency (putting everything in the local node). Usually you want the latter.

Does it really work this way (with automatic memory and core pinning)? Both Windows and Linux can do that?
Linux does that. Windows...

Judging by the performance I'm (not) getting, Windows does a very poor job with NUMA.

Does Linux do that OOB or do I need to do some configuration?
I never configured anything.