Hacker News new | ask | show | jobs
by ThePhysicist 1164 days ago
Pretty amazing that you can achieve such a throughput in a Golang userspace program. I wonder if other UDP based protocols like QUIC can attain those numbers as well.
4 comments

Interestingly, the fastest CPU based network switches tend to do full kernel bypass. The kernel is generally slow compared to OVS and VPP, especially when they traverse over something like DPDK.
Kernel bypass in DPDK grants the application direct access to DMA buffers so that the kernel is no longer involved. This is not because the kernel is slow, but because many small syscalls are expensive and putting your entire app in the kernel is a bad idea.

There is no kernel bypass in wireguard-go, just a user-space implementation fast implementation with smart use of syscalls to minimize the overhead of being split between user-space and kernel-space.

With io_uring, DPDK-style kernel bypass might stop making sense altogether.

It depends on what you are trying to do though. I don’t think the kernel has an easy path to operating on a set of packet headers as a vector at this point. Not saying it can’t happen, but it’s an area where user space is already ahead.

For reference, there was a previous test that demonstrated 40gbps with ipsec between two pods on separate nodes in k8s where the encap/decap achieved 40gbps which was the line rate for the Intel NICs used.

Details were published here: https://medium.com/fd-io-vpp/getting-to-40g-encrypted-contai...

I do agree that io_uring will negate the need for DPDK for many use cases though, it will likely be a much simpler path and more secure path than DPDK.

It's not "kernel is slow", kernel when left to its own devices is plenty fast, the reason is that when you want to make decision about packet in userspace (vs telling kernel what to do with it via various interfaces) that kernel logic would just be overhead.

It's similar for applications; if you can, say, decode whole DNS packet in one go, you don't really want kernel to spend time decoding UDP packet, then you decoding the rest of the packet; doing it in one step is much faster.

There are some applications where the ability to vectorize the headers and operate on them with SIMD help. These types of apps tend to pin a full core to do only packet processing though. Also, syscall are expensive. A lot of work is going into making the APIs async while avoiding syscalls.
Are there consumer (<$2k) network switches that can do Wireguard in a very fast path?
By their nature as L2/L3 devices, I wouldn't expect switches to ever support Wireguard. I also haven't heard of any hardware Wireguard yet. The fastest implementation so far might be TNSR which just squeaks in under $2,000.
Really depends on what you consider a “switch” to be. Most of Mikrotik’s CRS series supports full fat RouterOS, which includes wireguard support. Though the CPU on the CRS line is much cheaper than the proper routers (CCR series), so if you’re trying to do much more than a basic firewall and NAT on a residential connection (can probably handle 1Gbps fine on most of them) performance will not be great (even my CCR2004 can only handle ~3Gbps of IPSEC traffic).
Right, I misspoke. I should know better. I intended to say "a network device".
go is pretty fast

in fact, i have a standing bet with some of my rustacean friends that they can't show me a typical HTTP service in rust, which has performance numbers (rps, latency, throughput) that i can't meet or beat in go

of course lots of caveats there, what does normal-ish mean, well probably most of the work is gonna be i/o bound, it should run on normal server-class hardware, et cetera et cetera

but nothing yet

For most compiled languages or languages with very good VMs like Java benchmarks are really testing the quality of the implementation and the depth of the implementor's understanding.

I'd bet that very good Go and Rust programmers could probably converge to almost identical performance.

What I wouldn't be on is that Go could equal Rust in the area of small memory footprint or on small devices.

> What I wouldn't be on is that Go could equal Rust in the area of small memory footprint or on small devices.

I haven't found a microcontroller that's too small for tinygo. I have even used time.Format(time.RFC3339) on one before. $1 spent on a microcontroller is the ultimate luxury these days.

> I'd bet that very good Go and Rust programmers could probably converge to almost identical performance.

I'd imagine probably not purely because Rust uses LLVM which is VERY good at optimizing, while Go compiler is simpler and made for speed of compilation first. If Go got LLVM frontend yeah, maybe

> What I wouldn't be on is that Go could equal Rust in the area of small memory footprint or on small devices.

Well, Go is GCed, that automatically makes it use at least a bit more, and also carrying code for GC with each program.

> If Go got LLVM frontend yeah, maybe

While Go probably wont get an official LLVM frontend, the TinyGo project [1] is trying to bring Go to embedded systems and it does use LLVM. Unfortunately I couldn't find any use for it in a project since it lacks so many features from mainline Go. Maybe I'll check back in a few years.

[1] https://tinygo.org/

unrelated but your username is solid gold
That's my experience as well. Recently I rewrote a Golang-based QUIC server in Rust and I had a hard time getting it to perform equally well. Certainly possible but requires a lot of hand-tuning and knowing exactly what you do. In Golang you just spawn a Go routine for each request and avoid lock-based shared state as much as possible and you're mostly good, the runtime will manage all aspects like number of threads, allocations etc. for you.

One area where Rust is still better are memory-constrained environments e.g. on mobile and on microcontroller, though there's tinygo and the Go runtime gets slimmer as well, so now you can have binaries and memory footprints smaller than 5 MB on most mobile platforms, which is absolutely acceptable even for budget phones. I think Tailscale e.g. runs their modified version of wireguard-go on all mobile clients without issues.

Caddy (a web server written in Go) is like two times slower than Nginx on many benchmarks.
caddy is written in terms of net/http, nginx is written right on top of epoll/kqueue with bespoke HTTP/1x parser in a manually memory managed language. I think the point was not that go is "faster" than anything, it's that it makes it easy to write hard-to-beat network software once you get out of the realm of toy or highly specialized problems.
Maybe a highly tuned nginx by a Russian neckbeard. Who ridiculed you for ever attempting to use nginx without first consulting with an archaic text.

Otherwise nginx is a piece of garbage web-server that likes to pretend it’s 1995. On top of that, it’s one of the most inhospitable toxic communities I’ve ever encountered.

I’d take caddy any day of the week over nginx.

Seems Rust places well in some composite benchmarks. Go is further down the list. Of course this depends on the quality of the implementation and doesn't account for UX/usability

https://www.techempower.com/benchmarks/#section=data-r21&tes...

According to that benchmark, Javascript seems to be the way to go.
Not familiar with Just, but seems its designed as a minimal wrapper over v8. The benchmark code is probably jumping directly from JS to native code... most of the github repo is C++

Edit: There's actually an article that explains how Just ranks so highly https://just.billywhizz.io/blog/on-javascript-performance-01...

Slack has a system called Nebula that's pretty adjacent to userspace WireGuard.
Nebula is a Tailscale clone
Nebula predates tailscale.
Oh it does? I stand corrected.
link for the lazy: https://github.com/slackhq/nebula

Also with no ill intent looks like tailscale has the far more effective marketing organization :)

The team that built nebula at Slack actually split off and are building a similar type of network, that they give you a single pane of glass to manage, so you don’t have to manage a bunch of separate lighthouse devices manually.

I evaluated it when going through the many available options out there a year or two ago, and it was still pretty green feature wise. Still very cool and could work very well depending on your use cases. See: https://www.defined.net/

is the go userspace program actually shoveling this data or are they in-kernel buffer copies a la sendfile and the like
Yes, userspace has to touch the data to encrypt/decrypt it.