Interestingly, the fastest CPU based network switches tend to do full kernel bypass. The kernel is generally slow compared to OVS and VPP, especially when they traverse over something like DPDK.
Kernel bypass in DPDK grants the application direct access to DMA buffers so that the kernel is no longer involved. This is not because the kernel is slow, but because many small syscalls are expensive and putting your entire app in the kernel is a bad idea.
There is no kernel bypass in wireguard-go, just a user-space implementation fast implementation with smart use of syscalls to minimize the overhead of being split between user-space and kernel-space.
With io_uring, DPDK-style kernel bypass might stop making sense altogether.
It depends on what you are trying to do though. I don’t think the kernel has an easy path to operating on a set of packet headers as a vector at this point. Not saying it can’t happen, but it’s an area where user space is already ahead.
For reference, there was a previous test that demonstrated 40gbps with ipsec between two pods on separate nodes in k8s where the encap/decap achieved 40gbps which was the line rate for the Intel NICs used.
I do agree that io_uring will negate the need for DPDK for many use cases though, it will likely be a much simpler path and more secure path than DPDK.
It's not "kernel is slow", kernel when left to its own devices is plenty fast, the reason is that when you want to make decision about packet in userspace (vs telling kernel what to do with it via various interfaces) that kernel logic would just be overhead.
It's similar for applications; if you can, say, decode whole DNS packet in one go, you don't really want kernel to spend time decoding UDP packet, then you decoding the rest of the packet; doing it in one step is much faster.
There are some applications where the ability to vectorize the headers and operate on them with SIMD help. These types of apps tend to pin a full core to do only packet processing though. Also, syscall are expensive. A lot of work is going into making the APIs async while avoiding syscalls.
By their nature as L2/L3 devices, I wouldn't expect switches to ever support Wireguard. I also haven't heard of any hardware Wireguard yet. The fastest implementation so far might be TNSR which just squeaks in under $2,000.
Really depends on what you consider a “switch” to be. Most of Mikrotik’s CRS series supports full fat RouterOS, which includes wireguard support. Though the CPU on the CRS line is much cheaper than the proper routers (CCR series), so if you’re trying to do much more than a basic firewall and NAT on a residential connection (can probably handle 1Gbps fine on most of them) performance will not be great (even my CCR2004 can only handle ~3Gbps of IPSEC traffic).
There is no kernel bypass in wireguard-go, just a user-space implementation fast implementation with smart use of syscalls to minimize the overhead of being split between user-space and kernel-space.
With io_uring, DPDK-style kernel bypass might stop making sense altogether.