So why is it important to have a multisocket NUMA machine? Why not just save yourself a lot of hassle by having one socket? I know that the previous generation AMD machine had unavoidable NUMA but the new one doesn't.
This talk is about Zen+ Epyc, not Zen2 (which is where the non-cache memory gets uniform). I don't know if they have release quality Epyc 7003 (Zen2) samples available yet, and if they do, NFLX probably isn't allowed to publish benchmarks about them. There's almost certainly still some value in their existing NUMA work even on Zen2, as things like L1/L2/L3 cache have locality even if memory and PCIe does not.
Pretty sure Intel single socket of this generation is totally non-viable for this workload due to lack of PCIe lanes. Maybe viable when Intel gets gen4 PCIe.
Skylake-X has 44 PCI 3.0 lanes, that's 352GT/s or about 345gbps application bandwidth. It's certainly more than enough to push 100gbps from disk to net. These guys are pushing 200gbps, but they're doing it with two CPUs, two sets of NVMe devices, and two NICs, and a bunch of hacks to make the operating system pretend all this stuff is not in the same box. It seems way more straight-forward to me if they had made it all be actually NOT in the same box!
> Skylake-X has 44 PCI 3.0 lanes, that's 352GT/s or about 345gbps application bandwidth. It's certainly more than enough to push 100gbps from disk to net. These guys are pushing 200gbps
We're in total agreement :-). Their dataflow model requires something like 2x that in PCIe bandwidth and 4x in memory in the optimal case, as covered in the slides. 2x200 gbps = 400 gbps, which is a bit more than 345 gbps.
Maybe they could push 345/2 = 172 Gbps out of a single Skylake-X, best case. For some workloads, that might be the right local optima! They must have decided that the marginal cost of a 2P system was worth the extra ~25 Gbps to saturate the 200 Gbps pipe fully.
> they're doing it with two CPUs, two sets of NVMe devices, and two NICs, and a bunch of hacks to make the operating system pretend all this stuff is not in the same box. It seems way more straight-forward to me if they had made it all be actually NOT in the same box!
I've spoken with NFLX engineers in the past and my recollection is that in many installations, NFLX only get to install one box. (Or something like that. Might just be a cost thing.) So they need to make that one box fast.
I guess the other factor is the IP management overhead discussed in the slides. Two boxes necessitates the costly 2nd IP, as far as I know. It's hard to imagine the cost of an IP address dominating the marginal cost of a 2P socket system and 2nd Xeon, but I guess AWS is friggin expensive.
For the typical case, yeah, you don't want communication.
But if CPU#1 wants to access a file that is on CPU#2 NVMe nodes, NUMA allows you to share those files across memory (and its a "local" file according to the OS), instead of over NFS or SMB.
--------
And yes, as much as we like to pretend that there's no communication and everything scales horizontally... in practice... people like sharing files between systems. NUMA allows for these files (and other resources: such as PCIe network cards or GPUs) to be shared between systems at the speed of DDR4 memory.
One bigger host meams less management overhead. (see other comments about being u willing to run multiple ips and what not)
It may be a bit early to have a presentation from the latest Epyc processors. Most of the work was likely done with previous processors, but their slides said their AMD boxes are single socket.
Pretty sure Intel single socket of this generation is totally non-viable for this workload due to lack of PCIe lanes. Maybe viable when Intel gets gen4 PCIe.