Hacker News new | ask | show | jobs
by drewg123 3397 days ago
The important thing here, from my perspective, is how NUMA-ish a single socket configuration will be. According to the article, a single package is actually made up of 4 dies, each with its own memory (and presumably cache hierarchy, etc). While trivially parallelizable workloads (like HPC benchmarks) scale quite well regardless of system topology, not all workloads do so. And teaching kernel schedulers about 2 levels of numa affinity may not be trivial.

With that say, I'm looking forward to these systems.

1 comments

Intel's largest CPUs are already explicitly NUMA on a single socket. They call it Cluster On Die: http://images.anandtech.com/doci/10401/03%20-%20Architectura...
Very true, I should have mentioned that. At least for us, COD doesn't seem to impact our performance at all, while NUMA does. I'm hoping that Naples is the same for us.

However, there is an important difference. AMD seems to be putting multiple dies into the same package, whereas Intel seems to have (as the Cluster on Die name implies) everything on the same die. So my fear is that the interconnect between dies may not be fast enough to paper-over our NUMA weaknesses.

Sounds like your application is latency sensitive, and not bandwidth sensitive, take a look at the graphs towards the end of this article:

https://www.starwindsoftware.com/blog/numa-and-cluster-on-di...

There's not much difference in memory bandwidth between crossing domains on the same die (COD) vs crossing domains system wide (accessing memory for a different socket). What kind of computation are you running?

I'm talking about Netflix CDN servers. The workload is primarily file serving. The twist is that we use a non-NUMA aware OS (FreeBSD).

We're not latency sensitive at all. The problem we run into with NUMA is that we totally saturate QPI due to FreeBSD's lack of NUMA awareness.

The results you link to don't match with what we've seen on our HCC Broadwell CPUs, at least with COD disabled. Though we only really look at aggregate system bandwidth, so potentially the slowness accessing the "far" memory on the same socket is latency driven, and falls away in aggregate.