Hacker News new | ask | show | jobs
Surpassing 10Gb/S over Tailscale (tailscale.com)
176 points by mssdvd 1164 days ago
16 comments

Pretty amazing that you can achieve such a throughput in a Golang userspace program. I wonder if other UDP based protocols like QUIC can attain those numbers as well.
Interestingly, the fastest CPU based network switches tend to do full kernel bypass. The kernel is generally slow compared to OVS and VPP, especially when they traverse over something like DPDK.
Kernel bypass in DPDK grants the application direct access to DMA buffers so that the kernel is no longer involved. This is not because the kernel is slow, but because many small syscalls are expensive and putting your entire app in the kernel is a bad idea.

There is no kernel bypass in wireguard-go, just a user-space implementation fast implementation with smart use of syscalls to minimize the overhead of being split between user-space and kernel-space.

With io_uring, DPDK-style kernel bypass might stop making sense altogether.

It depends on what you are trying to do though. I don’t think the kernel has an easy path to operating on a set of packet headers as a vector at this point. Not saying it can’t happen, but it’s an area where user space is already ahead.

For reference, there was a previous test that demonstrated 40gbps with ipsec between two pods on separate nodes in k8s where the encap/decap achieved 40gbps which was the line rate for the Intel NICs used.

Details were published here: https://medium.com/fd-io-vpp/getting-to-40g-encrypted-contai...

I do agree that io_uring will negate the need for DPDK for many use cases though, it will likely be a much simpler path and more secure path than DPDK.

It's not "kernel is slow", kernel when left to its own devices is plenty fast, the reason is that when you want to make decision about packet in userspace (vs telling kernel what to do with it via various interfaces) that kernel logic would just be overhead.

It's similar for applications; if you can, say, decode whole DNS packet in one go, you don't really want kernel to spend time decoding UDP packet, then you decoding the rest of the packet; doing it in one step is much faster.

There are some applications where the ability to vectorize the headers and operate on them with SIMD help. These types of apps tend to pin a full core to do only packet processing though. Also, syscall are expensive. A lot of work is going into making the APIs async while avoiding syscalls.
Are there consumer (<$2k) network switches that can do Wireguard in a very fast path?
By their nature as L2/L3 devices, I wouldn't expect switches to ever support Wireguard. I also haven't heard of any hardware Wireguard yet. The fastest implementation so far might be TNSR which just squeaks in under $2,000.
Really depends on what you consider a “switch” to be. Most of Mikrotik’s CRS series supports full fat RouterOS, which includes wireguard support. Though the CPU on the CRS line is much cheaper than the proper routers (CCR series), so if you’re trying to do much more than a basic firewall and NAT on a residential connection (can probably handle 1Gbps fine on most of them) performance will not be great (even my CCR2004 can only handle ~3Gbps of IPSEC traffic).
Right, I misspoke. I should know better. I intended to say "a network device".
go is pretty fast

in fact, i have a standing bet with some of my rustacean friends that they can't show me a typical HTTP service in rust, which has performance numbers (rps, latency, throughput) that i can't meet or beat in go

of course lots of caveats there, what does normal-ish mean, well probably most of the work is gonna be i/o bound, it should run on normal server-class hardware, et cetera et cetera

but nothing yet

For most compiled languages or languages with very good VMs like Java benchmarks are really testing the quality of the implementation and the depth of the implementor's understanding.

I'd bet that very good Go and Rust programmers could probably converge to almost identical performance.

What I wouldn't be on is that Go could equal Rust in the area of small memory footprint or on small devices.

> What I wouldn't be on is that Go could equal Rust in the area of small memory footprint or on small devices.

I haven't found a microcontroller that's too small for tinygo. I have even used time.Format(time.RFC3339) on one before. $1 spent on a microcontroller is the ultimate luxury these days.

> I'd bet that very good Go and Rust programmers could probably converge to almost identical performance.

I'd imagine probably not purely because Rust uses LLVM which is VERY good at optimizing, while Go compiler is simpler and made for speed of compilation first. If Go got LLVM frontend yeah, maybe

> What I wouldn't be on is that Go could equal Rust in the area of small memory footprint or on small devices.

Well, Go is GCed, that automatically makes it use at least a bit more, and also carrying code for GC with each program.

> If Go got LLVM frontend yeah, maybe

While Go probably wont get an official LLVM frontend, the TinyGo project [1] is trying to bring Go to embedded systems and it does use LLVM. Unfortunately I couldn't find any use for it in a project since it lacks so many features from mainline Go. Maybe I'll check back in a few years.

[1] https://tinygo.org/

unrelated but your username is solid gold
That's my experience as well. Recently I rewrote a Golang-based QUIC server in Rust and I had a hard time getting it to perform equally well. Certainly possible but requires a lot of hand-tuning and knowing exactly what you do. In Golang you just spawn a Go routine for each request and avoid lock-based shared state as much as possible and you're mostly good, the runtime will manage all aspects like number of threads, allocations etc. for you.

One area where Rust is still better are memory-constrained environments e.g. on mobile and on microcontroller, though there's tinygo and the Go runtime gets slimmer as well, so now you can have binaries and memory footprints smaller than 5 MB on most mobile platforms, which is absolutely acceptable even for budget phones. I think Tailscale e.g. runs their modified version of wireguard-go on all mobile clients without issues.

Caddy (a web server written in Go) is like two times slower than Nginx on many benchmarks.
caddy is written in terms of net/http, nginx is written right on top of epoll/kqueue with bespoke HTTP/1x parser in a manually memory managed language. I think the point was not that go is "faster" than anything, it's that it makes it easy to write hard-to-beat network software once you get out of the realm of toy or highly specialized problems.
Maybe a highly tuned nginx by a Russian neckbeard. Who ridiculed you for ever attempting to use nginx without first consulting with an archaic text.

Otherwise nginx is a piece of garbage web-server that likes to pretend it’s 1995. On top of that, it’s one of the most inhospitable toxic communities I’ve ever encountered.

I’d take caddy any day of the week over nginx.

Seems Rust places well in some composite benchmarks. Go is further down the list. Of course this depends on the quality of the implementation and doesn't account for UX/usability

https://www.techempower.com/benchmarks/#section=data-r21&tes...

According to that benchmark, Javascript seems to be the way to go.
Not familiar with Just, but seems its designed as a minimal wrapper over v8. The benchmark code is probably jumping directly from JS to native code... most of the github repo is C++

Edit: There's actually an article that explains how Just ranks so highly https://just.billywhizz.io/blog/on-javascript-performance-01...

Slack has a system called Nebula that's pretty adjacent to userspace WireGuard.
Nebula is a Tailscale clone
Nebula predates tailscale.
Oh it does? I stand corrected.
link for the lazy: https://github.com/slackhq/nebula

Also with no ill intent looks like tailscale has the far more effective marketing organization :)

The team that built nebula at Slack actually split off and are building a similar type of network, that they give you a single pane of glass to manage, so you don’t have to manage a bunch of separate lighthouse devices manually.

I evaluated it when going through the many available options out there a year or two ago, and it was still pretty green feature wise. Still very cool and could work very well depending on your use cases. See: https://www.defined.net/

is the go userspace program actually shoveling this data or are they in-kernel buffer copies a la sendfile and the like
Yes, userspace has to touch the data to encrypt/decrypt it.
What‘s missing from all these figure is the resulting latency. It‘s often the case that vendors show impressive throughput numbers, but then the latency is terrible at that throughput.

Do you have those numbers as well?

We do look at them to check on how we're doing, and I want to dig into this area more over time. In particular we don't do classful prioritization right now, which if you look at the typical tests for this they're often focused on multi-flow classifications. We also don't set specific congestion algorithms on our interfaces right now - availability is variable, as is the cost of them. You can see in the post here that Jordan documents that the tests in the blog were all explicitly over cubic.

We increased the sizes of the UDP buffers in the prior round of optimizations. The kernel defaults for UDP buffers are too small to approach the throughput discussed here - and the default sizings were the primary source of lots of dropped packets. I raised those to 7mb, which seems like an odd number, but it's the largest you can set on macOS before the kernel rejects it - likely we'll eventually head for a per-platform split. At these speeds a 7mb buffer represents up to 5ms of flow data, though this does not imply that it creates 5ms of bufferbloat - it just means that this increased buffer could itself account for 5ms in the worst non-lossy case. On the userspace side Tailscale also has some more buffer space now (we're reading and writing lists of packets at a time, not single packets), but the sizing there is more complex.

This topic in general is much more complex - in the first throughput post I originally started to dig into it, and we cut that in editing because it was making the post too dense and there wasn't space to give the topic the attention it deserves. One day we'll talk about this too. Typically right now we add very little latency, low millis or lower - we actually add more jitter than latency, as any userspace program would. It's still orders of magnitude lower than the levels which even concern a typical realtime application such as gaming or communications - for example someone was recently talking about using Tailscale on their Steamdeck while on vacation to play Hogwarts streaming from their PC.

In the meantime, a real world example for you. I have a border router that I built using a relatively cheap piece of hardware (Intel(R) Celeron(R) J4105 CPU @ 1.50GHz). It has NICs that support GRO/GSO, but the CPU is the bottleneck for throughput. The box does 563MBits/sec inbound to the LAN over Tailscale (949 Mbits/sec raw). I run this as an exit-node for my workstation all the time, even though that's in the same building - and do so for the sake of diagnosing bugs and experiencing the product full time. In my initial test today, under peak load the exit node adds 35ms of latency each way. I was surprised by this, so I checked when going direct rather than via the exit node, I see 15ms down and 30ms up of latency increase under peak load. It seems Comcast dropped some capacity since I last tuned my uplink!

I then re-tuned CAKE on the router uplink to be more aggressive resulting in a raw bloat of 0ms/0ms, and then retested with the Tailscale exit node. With these more aggressive CAKE tunings, Tailscale also stayed at 0ms/0ms. This CAKE tuning ate a chunk of throughput capacity, as expected. The specific tuning here being for a Comcast 1000/40 link, and the system CPU bound at 500mbps for forwarding:

  + tc qdisc add dev internet root handle 1: cake docsis ack-filter-aggressive nat bandwidth 40mbit lan
  + ip link add name ifbinternet type ifb
  + tc qdisc add dev internet handle ffff: ingress
  + tc qdisc add dev ifbinternet root cake bandwidth 500mbit lan
  + ip link set ifbinternet up
  + tc filter add dev internet parent ffff: matchall action mirred egress redirect dev ifbinternet
On the LAN side, between the same machines (fq_codel only, default settings), running iperf3 alongside ping:

Under max load ([ 5] 0.00-57.73 sec 3.72 GBytes 554 Mbits/sec receiver):

  10 packets transmitted, 10 received, 0% packet loss, time 9013ms
  rtt min/avg/max/mdev = 2.625/3.620/4.536/0.646 ms
Zero load:

  10 packets transmitted, 10 received, 0% packet loss, time 9014ms
  rtt min/avg/max/mdev = 0.648/0.954/1.713/0.306 ms
What do these numbers mean? In practice they mean you'll notice WiFi more than you'll notice Tailscale, but we can and will still do better over time. Here's WiFi from a MacBook to the border router on the same LAN segment (no WireGuard/Tailscale):

  10 packets transmitted, 10 packets received, 0.0% packet loss
  round-trip min/avg/max/stddev = 3.845/11.363/34.152/8.940 ms
This is already long for an HN response, and so much more to say, but I hope it helps!
Very curious to learn more about CAKE tuning with tailscale, would love to see a post someday about how the two interact and when/why it might be needed?
I look forward to more, at a longer RTT.
Okay, as far as I understand this writeup.

There are two sides, userspace UDP socket to receive wg packets on. Then the tap file descriptor to receive unencrytped packets from the host OS.

To speed up the userspace UDP socket it's desirable to use UDP_GRO flag on RX, and UDP_SEGMENT flag on TX. `tx-udp-segmentation` is a HW help for the latter. No need for any checksums and stuff. This is just speedup for userspace "classic" UDP socket.

However, buffering with UDP_GRO is interesting, since you need to pass potentially large 64KiB buffer to kernel since you don't know how large the next GRO-packet is. (this is a digression)

On the tap side, the article implies they enabled TUN_F_TSO4, which is a magical offload flag on tun interface. With it it is possible to get large pakets form the host OS. This is where it gets interesting. If you get a very large block from the host, like say 14KiB or larger.... how do you push it to the wireguard socket? I guess it's nececesary to packetize it back to small-MSS packets before encrypting. That means recreating TCP headers (with seq numers) and filling the checksum. This sounds like "fun".

The same on TX side towards the host... if you get a number of TCP segments from the wg tunnel, decrypt them.... do you push them as one large TUN_F_TSO segment to tun? or do you push one-by-one and rely on the kernel to GRO them? I didn't quite get it from the article. Or maybe it's possible to send large packets over wg without segmentation?

The same discussion is about UDP. With UDP you can use TUN_F_USO, however, this is only available in kernel 6.2. This might be why there arent' too many UDP numbers in the article.

The missing feature from Tailscale for me is the ability to host a Tailscale only DNS zone.

They have Magic DNS, but that only works for individual Tailscale nodes. I want multiple DNA records pointing to a single Tailscale node. Would be even better if I could use my own domain (subdomain even better) instead of their long `foo-bar.ts.net` domain.

Currently need to do this manually, but seems overly redundant since Tailscale already does 90% of this with MagicDNS and is fast because it's in their client vs a remote server.

Step 1: install Tailscale and Docker on a VM or whatever

Step 2: set up a Technitium container in host networking mode

Step 3: configure Technitium with a stub zone pointing your ts.net name at 100.100.100.100

Step 4: set up a zone for whatever.tld

Step 5: set up a DNAME record for ts.whatever.tld pointing at your ts.net domain

Result: querying this new DNS server with machine.ts.whatever.tld resolves to machine.blah-foo.ts.net resolves to that machine's 100.64.0.0 address.

https://technitium.com/dns/

I know this can be done manually (and I do), but the issue with that is that: 1. It's manual 2. Single point of failure of this server that was needed

My point was that MagicDNS is implemented in the Tailscale client on each machine (fault tolerant, 0ms latency) and has almost all the things necessary (DNS resolver, push mechanism for record updates) except for a custom defined zone.

Running `drill @100.100.100.100 <node_name>.<magic_dns_domain>.ts.net` is 0ms because it's local, and doesn't depend on a single DNS server running somewhere on my Tailscale network.

Yep, that's fair. I actually run this setup on every machine in my lab. Technitium is so light weight and with this setup I don't need to jump through any hoops to get Docker containers to resolve Tailscale names.
I'd never heard of Technitium, but was intrigued looking at. Was thinking "hmmm what could I do with this" and then had to refrain from creating another project just because.

TBH I find Docker networking a struggle and usually disable the `iptables` stuff and end up configuring my own rules. Painful, but at least less intrusive.

On the note of Tailscale+Docker networking, gluetun[0] is pretty awesome. It runs a Wireguard (not tailscale compatible, yet) instance within a Docker container and then you share that networking namespace with the other containers effectively confining them to the VPN. Comes with basic container namespace firewall configuration and DNS over TLS configuration.

[0] https://github.com/qdm12/gluetun

There is an open GitHub issue for this and it’s already been implemented in the Tailscale client, it’s really nice too as the DNS records are pushed out to the local DNS resolver on each Tailscale client, rather than being lookups to a separate server, so it’s super fast.

Unfortunately there aren’t any options for it on the Tailscale control panel, but if you use Headscale you can configure it and take advantage of it now.

I searched and couldn't find anything in the tailscale client repo. Link to the issue?

Did find headscale docs about "Setting custom DNS records"[0]. It seems only `A` and `AAAA` records are supported. This might be the start of setting up headscale this weekend.

[0] https://github.com/juanfont/headscale/blob/main/docs/dns-rec...

Tailscale is awesome, so damn recommended. Taildrop (AirDrop for everything, included in Tailscale) is especially recommended, it makes it so damn easy to send files between all your devices.
https://tailscale.com/kb/1106/taildrop/ seems to be the docs.

It's the first I hear of this. I wonder if there's any big advantage of this for someone who is already using syncthing for the same purpose? Biggest thing I could hope for is that it's faster. But I generally don't keep Tailscale running on mobile because I don't need it to and don't like the persistent notification.

Sync, continuous backup and transfer are all quite different use-cases.

Most backup/sync products are designed to work in the background and often require upload before download. I don’t know if syncthing does streaming syncs though.

Another difference is transfers can easily be untrusted, as in sender and receiver don’t need access to each others file systems. Take magic wormhole (or email attachments for that matter) as an example.

Taildrop is somewhere in between – I think you have to be on same tail net, but no need for awareness of the other device’s file system.

Syncthing is slower, you need to act on both devices.

With Taildrop you just need to share something with a couple of clicks, and it'll appear on the device(s) you share it to.

I’ve really wanted to try tailscale. I fear I’ll like it, and I don’t want another company to have a monopoly on simple things so everyone forgets how to do them.
Had similar feelings and did like it more then I thought I could.

My escape hatch from the monopoly is headscale[0] which I can self host.

[0] https://github.com/juanfont/headscale

You can even host Headscale over Tailscale, amusingly: https://tailscale.dev/blog/headscale-funnel
I don't think I'd classify their zero-config p2p-style VPN as "simple" -- or at least, certainly not simple to replicate...

More to the point, I hope their technology becomes commonplace & gratis a la LetsEncrypt for SSL Certificates.

I mean, setting up a WireGuard vpn is pretty darn simple, even into a k8s cluster. It’s not rocket science or anything; which is kinda my point. They make it too easy, and that worries me.
I've switched to tailscale because their nat busting is actually hard to do "by hand"
That’s sort of the problem, right. Joining two networks is pretty simple, once you do it a few times. I remember when it was mandatory to know how to set up an email server (for more than one user), configure a secure FTP (+ WebDAV for a little while), and probably other things I’ve totally forgetting about. These things were passed down from senior to junior like we pass down how to write Docker images, and set up our ide while those very simple services of yesterday have been eaten up by monopolies. I’m not saying we shouldn’t have services to make our lives easier… I’m saying we should have more of them. I’m not interested in this space, but someone who is should see this company and go “damn, these guys have validated an idea for me. Maybe I can take some of their pie.” Instead, we just give them more money …

Look at email. It’s basically a “lost technology” in that it is nearly impossible to self-host (though there are people out there doing it, there are very few modern guides from zero to production). Same with file sharing and IRC servers.

Maybe I’m just rambling in my “old” age…

Tailscale has several competitors such as ZeroTier and Nebula. There does appear to be a winner-take-all dynamic where being slightly better lets Tailscale take 10x more mindshare than competitors, but I don't see any way around that.
none of those things are analogous to Tailscale having done loads of hard work to automate NAT busting.
Setting up a few p2p wg VPNs is manageable.

However, when you have 10 nodes and need to add one more node, you now need to update all other nodes so they can speak p2p. Management with scale is the struggle.

If you have 10 nodes, you should already be automating with ansible/chef/puppet/whatever, at which point adding another link config is easy.
For servers sure, but things like `tailscale` exist to save every laptop and cell phone from looking like a devops project.

Furthermore you could extend this argument almost every other cloud service with a primary feature of "convenience" and/or "management". Just build everything yourself.

Tailscale is amazing. I was able to set up our AWS VPN with it in <30 minutes, and it's just worked ever since. Getting new users set up is similarly seamless.

If this means I continue to forget how to run OpenVPN I consider that well worth it.

It's made putting internal apps in a private subnet on a VPC a very trivial process. Like took me an afternoon and works well for my small 40 person company.
Userspace networking makes me a bit sad in that it's much harder for users to observe or instrument. It's convenient for app developers, but to lock users out of seeing what is happening on their own systems feels awful.
Why does the in-kernel WireGuard perform so much worse on the AWS instances?
Nice improvements! I'd be interested to see how much overhead tailscales magicsock adds and what a flamegraph after the change looks like. Mostly crypto or still a lot of networking syscall time?
magicsock definitely does a bunch more work, and we do look at both profiles. The magicsock profile is harder to read as a consequence of being a more complex path, adding packet filters, the indirection for DERP and other NAT busting details, etc. Jordan did do some optimizations in the magicsock path alongside this wireguard-go work to get us over the 10gbps line.

Overall the summary of time spent is still a similar story at the coarse scale - our recent optimizations mean that we're getting ever closer to the point where we need to start working on the next layer, such as optimizing the queues (visible here in the chanrecv and scheduler times - Go runtime stuff), and once we get that out of the way things like crypto and copying will become targets. The work goes on, we have lots of plans and ideas!

Super neat.

Have these optimizations (TCP GRO/GSO) been applied to non-root tailscale? I imagine, the changes needed are wildly different as the TUN device itself is gvisor/netstack. I believe, the UDP GRO/GSO part (discussed in today's blog post) may work as-is.

Good question, it's bits and pieces. I know there's more we can do with the userspace stack - netstack has some support for GRO/GSO, but unless I'm forgetting a detail we haven't fully plumbed that yet. It would definitely be interesting to do so - avoiding TUN turnaround while still utilizing mmsg and so on should provide excellent performance for something like a tsnet/libtailscale based server. We did recently improve performance in that configuration by enabling SACK, which is very significant.
I have no idea what a gigabit per siemens is supposed to mean.
The blog post's actual title doesn't use that case.
Half-way through the article it just says UDP receive coalescing, once, and never mentions it again. Do they mean interrupt mitigation? If so, using what parameters?
I guess UDP receive coalescing is UDP GRO (generic recv offload) + recvmmsg(2)
Yup, this is referring to GRO.

IIRC we use a contiguous 64kb buffer in the first scatter-gather slot and 128 messages per syscall in the current tuning.

Author here. There was no interrupt tuning performed on the devices under test. UDP receive coalescing was enabled via the UDP_GRO sockopt.
Surely the problem with GSO is that you're now bursting UDP over the wire as fast as possible and that will be problematic for downstream switches?
Love to hear optimization stories, great work!
I see a new blog post from Tailscale, I upvote.