Hacker News new | ask | show | jobs
by 5e92cb50239222b 1174 days ago
Since tailscaled uses the tun/tap driver and thus copies all traffic to userspace (and back), it is extremely inefficient. On my Haswell i5 (plus multiple servers with comparable hardware) the process consumes 40% of CPU time at just 4 MiB/s, and close to 100% at 10-11 MiB/s (with recent sendmmsg/recvmmsg patches¹).

This is about ~2-3x worse than similar applications written in highly optimized C, so don't expect any miracles from further optimizations unless they switch to kernel Wireguard (which doesn't seem likely in the nearby future).

They claim it's very difficult if not impossible, but this sounds like an issue with their architecture — a similar application from their competitors² has had kernel WireGuard support from the start (no relation, I don't even use it and cannot recommend for or against it).

1: https://tailscale.com/blog/throughput-improvements

2: https://github.com/netbirdio/netbird

3 comments

Tailscalar here, for what it's worth, I run my plex server on Tailscale (i5 10600) and I haven't noticed any observable lag due to the TUN/TAP driver. Even with 4k bluray rips at several tens of megabits per second of video quality. I also regularly get near the limit of gigabit ethernet when transferring big files like machine learning models (the 1280 byte MTU plus WireGuard overhead adds up over time and can make the application observed rate be less than what the NIC is actually doing).

Kernel WireGuard for Tailscale is hard because of DERP (HTTPS/TCP fallback relay, all connections start over DERP so that they can Just Work if hole punching fails), but I'm sure it could happen with the right combination of eBPF and Rust in the kernel. It'd be a bit easier if there was a high level abstraction for using the kernel TLS stack to do outgoing TLS connections.

Isn’t it also a UDP issue in general or at least the way packet switching works in Golang on major OSs? I did a bandwidth benchmark over local network over tailscale vs vanilla (in the 100MB/s ballpark) and tailscale was 10-20% slower and used tons of CPU.

As a baseline I tried pushing blank UDP packets with Golang (on Darwin and Linux) at saturated capacity and it ALSO used similar excess CPU, causing dropped packets. My take at the time was that it was primarily the syscall overhead per packet (vs per arbitrarily sized buffer in TCP), and a lack of efficient OS APIs in Golang. Is there truth to this analysis?

Hi! Tailscaler here, one of the folks who worked on the recent throughput improvements. One of the machines I was testing with during our work on segment offloading was a Haswell. I absolutely understand your concern if we're using 40% of CPU at 4MiB/s, we should be doing substantially better than that on efficiency. In our various testbeds which include CPUs like yours, we see higher performance. If you'd like us to look into the issue, do email support@tailscale.com - we'd be really happy to dig in and find the cause.

We have continued our work on performance improvements, and along that path, as an example, we recently diagnosed an issue with a change in the kernel frequency scaling governor that has a regression that Tailscale can tickle and we have an ongoing discussion with the kernel maintainers about that problem. I'm not at all assuming this particular thing is the key source of the performance you're observing, it is more to provide an anecdote that we're still digging deep into areas where we aren't performing well and finding the root cause, and working both inside and outside to address those and where appropriate to add workarounds as well.

I observe there's about 37% overhead when using TS connection on a local gigabit network.

Copying large file from Synology DS1821+ NAS (Amd Ryzen V1500B) to Windows PC (i7-6700K) is about 111-113 MB/s when accessing NAS directly and 70-73 MB/s when traffic goes through TS (different large files, so no caching here).

My back of the napkin math says there should be a 40 byte overhead for wireguard around tailscale 1280 byte packets. That's only about a 3% overhead on the direct wire. What is your testing methodology so I can attempt to replicate it in the lab?
I meant overhead in a broad sense - both packet size and CPU load combined - what end user actually care about.

My test is what I have to do fairly often: use Windows Explorer to copy 70-100gb file from a network NAS to a local drive. Every so often I click on the wrong network share pinned in the Explorer and see slow transfer speed.