| Now do that across thousands of connections. While retaining very low p99 latency. Just the idea that using a bytestream is ok is leaving opportunity on the table. If you know what protocols you are sending, you can allow some out-of-order transmission. Asking the kernel or dpdk or whatever to juggle contention sounds like a coherency nightmare on large scale system, is a very hard scheduling problem, that a hardware timing wheel is going to be able to just do. Getting reliability & stability at massive concurrency at low latencies feels like such an obvious place for hardware to shine, and it does here. Maybe you can dedicate some cores of your system to maintain a low enough latency simulacra maybe, but you'd still have to shuffle all the data through those low latency cores, which itself takes time and system bandwidth. Leaving the work to hardware with its own buffers & own scheduling seems like an obviously good use of hardware. Especially with the incredibly exact delay based congestion control their close cycle timing feedback gives them: you can act way before the CPU would poll/interrupt again. Then having own Upper-Level-Protocol processors offloads a ton more of the hard work these applications need. You don't seem curious or interested at all. You seem like you are here to downput and belittle. There's so many amazing wins in so many dimensions here, where the NIC can do very smart things intelligently, can specialize and respond with enormous speed. I'd challenge you to try just a bit to see some upsides to specialization, versus just saying a CPU hypothetically can do everything (and where is the research showing what p99 latency the best of breed software stacks can do?). |
However, people like yourself talk about these hardware stacks as if they have clear advantages in performance and latency and isolation. They make uncurious and dismissive comments without evidence that this level of results is only achievable with dedicated hardware.
The only consistent conclusion I can come up with is that everybody just uses really bad software stacks which makes these dedicated hardware solutions seem like major improvements when they are just demonstrating performance you should expect out of your software stack. The fact that this is considered a serious improvement over RoCE which is itself viewed as a serious improvement over things like the Linux kernel TCP software stack lends support for my conclusion.
I make comments on various posts about network protocols to see if I am missing something about the problem space that actually makes it hard to do efficiently in a software protocol. Mostly I just get people parroting the claim that a performant software solution is impossible due to easily solved problems like loss/recovery/retransmission instead of actually indicating hard parts of the problem.
And as for what would be useful hardware I would go with a network with full 64K MTU and hardware copy offload with HBM or other fast bus. Then you could pretty comfortably drive ~10 Tb/s per core subject to enough memory/bus bandwidth.