| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Teknoman117 21 days ago

Gen 4.0 x16 is 32 GB/s in each direction, but the way this is implemented is not the way you'd go about this if you wanted high performance.

Edit: Their benchmarks are also run using ZRAM, which compresses pages before writing to swap. Not sure what the performance overhead of that is, but it's probably quite a bit.

First of all, it's a userspace program hooking the nbd driver, which is known for being slow. It also uses a bounce buffer in userspace before transferring to the GPU. So when the kernel needs to swap a page, it has to first copy it into a userspace facing buffer. The userspace program that has to wake back up and issue the cuda operation to copy the page into device memory.

nbd also doesn't really do a good job of supporting high queue depth or merging adjacent accesses. So if the kernel is issuing a bunch of 4K page swaps without any coalescing, you're going to end up with at least million kernel/userspace context switches per second just to handle 4 GB/s (4 GB / 4K page), let alone 64 GB/s. And that's just the NBD portion, forget the mess that is the NVIDIA driver. PCIe can move a lot of data, but in order to get anything even resembling the full bandwidth, you have to have use DMA engines with long page lists. Having to set up a transfer for every 4K page over PCIe will not reach full saturation of the bus.

Swapping to NVMe is a very optimized path -> the swapper can submit lists of pages directly to the NVMe driver and the controller can DMA them directly out of RAM, no copies or context switches CPU side at all.

This could probably be improved by migrating to the ublk driver as it might let you avoid the userspace bounce buffer. It'd also be able to have multiple write queues to at least set up CUDA copies in parallel.

2 comments

tumblestick 20 days ago

It's true that Linux kernel is the throughput bottleneck. Unfortunately, the optimizations described above aren't sufficient to get within even 10% of hardware bandwidth.

Even if the swap system overhead drops to just a data copy, the memory management layer prevents swap from scaling to higher bandwidths. The issue is not data movement; it is in the page unmapping step (which requires expensive TLB shootdowns). Larger kernel changes are required.

My group wrote a paper on this: https://dl.acm.org/doi/10.1145/3731569.3764842

Linux's swap system is undergoing some large refactors lately. Hopefully some insights either from our work or Hermit (NSDI '23) can make it in to the mainline. I think Hermit's `rmap` optimization in particular is a candidate for upstream use.

link

lstodd 21 days ago

yup. it's nbd and userspace making it slow. zram on the other hand adds little.

one can get rid of zram and just reimplement some compression in shaders but I think that would be a pointless optimization.

link