| > Is this just a cost efficiency thing? It's not entirely, but even that would be a justifiable reason. Tail behavior of all sorts matters a lot, sophisticated congestion control and load-balancing matters a lot. ML training is all about massive collectives: a single tail latency event in a NCCL collective means all GPUs in that group are idling until the last GPU makes it. > It only takes like 1 core to terminate 200 Gb/s of reliable bytestream using a software protocol with no hardware offload over regular old 1500-byte MTU ethernet. The conventional TCP/IP stack is a lot more than just 20GB/s of memcpy's with 200 GbE: there's a DMA into kernel buffers and then a copy into user memory, there's syscalls and interrupts and back and forth, there's segmentation and checksums and reassembly and retransmits, and overall a lot more work. RDMA eliminates all that. > all you need is a parallel hardware crypto accelerator
> all you need is a hardware copy/DMA engine And when you add these and all the other requirements you get a modern RDMA network :). The network is what kicks in when Moore's law recedes. Jensen Huang wants you to pretend that your 10,000 GPUs are one massive GPU: that only works if you have Nvlink/Infiniband or something in that league, and even then barely. And GOOG/MSFT/AMZN are too big and the datacenter fabric is too precious to be outsourced. |
Hardware crypto acceleration and a hardware memory copy engine do not constitute a RDMA engine. The API I am describing is the receiver programming into a device a (address, length) chunk of data to decrypt and a (src, dst, length) chunk of data to move, respectively. That is a far cry from a whole hardware network protocol.