| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bt848 2457 days ago
	So why is it important to have a multisocket NUMA machine? Why not just save yourself a lot of hassle by having one socket? I know that the previous generation AMD machine had unavoidable NUMA but the new one doesn't.

7 comments

loeg 2457 days ago

This talk is about Zen+ Epyc, not Zen2 (which is where the non-cache memory gets uniform). I don't know if they have release quality Epyc 7003 (Zen2) samples available yet, and if they do, NFLX probably isn't allowed to publish benchmarks about them. There's almost certainly still some value in their existing NUMA work even on Zen2, as things like L1/L2/L3 cache have locality even if memory and PCIe does not.

Pretty sure Intel single socket of this generation is totally non-viable for this workload due to lack of PCIe lanes. Maybe viable when Intel gets gen4 PCIe.

link

bt848 2457 days ago

Skylake-X has 44 PCI 3.0 lanes, that's 352GT/s or about 345gbps application bandwidth. It's certainly more than enough to push 100gbps from disk to net. These guys are pushing 200gbps, but they're doing it with two CPUs, two sets of NVMe devices, and two NICs, and a bunch of hacks to make the operating system pretend all this stuff is not in the same box. It seems way more straight-forward to me if they had made it all be actually NOT in the same box!

link

loeg 2457 days ago

> Skylake-X has 44 PCI 3.0 lanes, that's 352GT/s or about 345gbps application bandwidth. It's certainly more than enough to push 100gbps from disk to net. These guys are pushing 200gbps

We're in total agreement :-). Their dataflow model requires something like 2x that in PCIe bandwidth and 4x in memory in the optimal case, as covered in the slides. 2x200 gbps = 400 gbps, which is a bit more than 345 gbps.

Maybe they could push 345/2 = 172 Gbps out of a single Skylake-X, best case. For some workloads, that might be the right local optima! They must have decided that the marginal cost of a 2P system was worth the extra ~25 Gbps to saturate the 200 Gbps pipe fully.

> they're doing it with two CPUs, two sets of NVMe devices, and two NICs, and a bunch of hacks to make the operating system pretend all this stuff is not in the same box. It seems way more straight-forward to me if they had made it all be actually NOT in the same box!

I've spoken with NFLX engineers in the past and my recollection is that in many installations, NFLX only get to install one box. (Or something like that. Might just be a cost thing.) So they need to make that one box fast.

I guess the other factor is the IP management overhead discussed in the slides. Two boxes necessitates the costly 2nd IP, as far as I know. It's hard to imagine the cost of an IP address dominating the marginal cost of a 2P socket system and 2nd Xeon, but I guess AWS is friggin expensive.

link

drewg123 2457 days ago

I've run this particular AMD system in 3 ways: - Non-NUMA - 2 nodes per socket - 4 nodes per socket

The 4NPS gives the best performance, followed by 2NPS, followed by non-NUMA. This surprised me as well.

link

baolongtrann 2457 days ago

It's money. Having a two-socket machine instead 2 one-socket machines, even when performance is only 80%, is saving both OPex and CAPex.

link

dragontamer 2457 days ago

Communication over DDR4 is way faster than communication over PCIe Ethernet. Even 40Gbit is slow compared to RAM.

link

bt848 2457 days ago

Maybe but this post is about making the two sides of the computer NOT communicate.

link

dragontamer 2457 days ago

For the typical case, yeah, you don't want communication.

But if CPU#1 wants to access a file that is on CPU#2 NVMe nodes, NUMA allows you to share those files across memory (and its a "local" file according to the OS), instead of over NFS or SMB.

--------

And yes, as much as we like to pretend that there's no communication and everything scales horizontally... in practice... people like sharing files between systems. NUMA allows for these files (and other resources: such as PCIe network cards or GPUs) to be shared between systems at the speed of DDR4 memory.

link

kylek 2457 days ago

Density/power/cooling are real constraints when you aren't 100% bought-in to the cloud.

link

toast0 2457 days ago

One bigger host meams less management overhead. (see other comments about being u willing to run multiple ips and what not)

It may be a bit early to have a presentation from the latest Epyc processors. Most of the work was likely done with previous processors, but their slides said their AMD boxes are single socket.

link

o-__-o 2457 days ago

Something something performant concurrency

link