Hacker News new | ask | show | jobs
by tptacek 1045 days ago
UDP is connectionless precisely so you can build novel stateful protocols on it. There’s no promise in UDP that you’ll be able to statelessly monitor it.
2 comments

UDP is actually more expensive to NAT than TCP is. The reason is UDP fragmentation, which is my vote for the worst, and least forgivable, design error of TCP/IP.

Instead of putting the fragmentation in L4 (like QUIC now does) and including a UDP header on every fragmented packet in a datagram, UDP only includes the header on the first packet. With fragmentation happening; firewalls, NATs, and end-hosts have to buffer and coalesce IP packets based on IP IDs, before the destination can be identified. It's a real nuisance. A lot of CGNAT "stateless" implementations can't handle this and you get very hard to debug issues when there are fragmentation and MTU mismatches.

This is probably more accurately called IP fragmentation (since that is the layer where the fragmentation happens), and a lot of companies make it optional to support in networking gear. I'm surprised that you are using it or seeing it, because it is essentially obsolete today.

It has a legitimate purpose in old-timey systems which have bespoke MTUs on each link, but now the usual thing is to use 1500 bytes for WAN traffic, which is the generic Ethernet MTU, and reserve larger sizes for intra-datacenter communications.

There's a number of UDP protocols that have large enough payloads to fragment. DNSSEC and EDNS0 in particular made it much more common, though the EDNS0 flag day in 2020 partially undid some of the damage by getting folks to ratchet down their EDNS0 buffer sizes.

1500 is absolutely not a pervasively usable WAN MTU, you're going to need pMTUd if you're sending 1500 byte packets broadly. Plenty of WAN links won't tolerate it. If you don't want to deal with fragmentation at all ... 500 is the minimum guaranteed MTU, but in practice it's exceptionally rare to see anything below about 1200 require fragmentation. But you can always only control what you send, not what others are sending you.

One thing I've learned since joining Fly.io in 2020 is to laugh when people point to the 1500 MTU. You absolutely can't count on that: IPv6 cuts into it, and so does every additional layer of encapsulation on your path.
Yeah, you have to account for the headers in the 1500 byte MTU, which I suppose can be substantial if you have several VLAN tags, IPSec, IPv6, and a bunch of IP options. Presumably most of that encapsulation happens inside a datacenter, though, where you can use jumbo frames.
With IPv6 only the endpoint can fragment, not any hop in between.
Even well-behaved unfragmented UDP should be more expensive to NAT because it doesn't have an end-of-stream "FIN" marker, meaning stateful middleboxes need to retain state for longer because they can only time out.
Timeouts on UDP are usually much shorter than TCP, so it's not as bad as it sounds.
But TCP fragments in the same way?
TCP does not use IP fragmentation, and the IP packets are marked "Don't fragment". TCP performs its own fragmentation and every packet gets a TCP header in its leading section. A NAT, Firewall, or end-host can L4 route the TCP packet as-is and does not need to correlate with other packets.

Edited to extend: this is why TCP has a "Maximum Segment Size", and why Path MTU Discovery information has to be passed into the TCP state machine. It is TCP that takes responsibility for carving up the data into the packets, not IP.

One of the goals of UDP was to avoid needing this kind of state, which is why the IP layer handles fragmentation for it instead. This is allowed on a hop-by-hop basis, unless the DF bit is set; so when a "too big" packet gets to a node with a smaller MTU, it can just split it and send on the fragments. No PMTUD needed.

The design could have been for the fragmenting node to also add a UDP header as part of that process, but was not. It would have been a simple change at the time. It's had a lot of consequences since and is responsible for a decent amount of complexity in hardware and software packet pipelines.

It could not have copied the UDP header. Otherwise you wouldn't be able to put any new protocol on IP without teaching it to every router.
Several other protocols solve this in a layering agnostic way by simply having a header length field. The header bytes can then be copied without any understanding of the format. This is even how IP's own ICMP protocol knows how much of an IP packet it should (at least) include in an error message so that the sender can know what triggered the error.

TCP, UDP, ICMP and IP were all designed contemporaneously; UDP fragmentation could also easily have just been specified for. It's just an odd regrettable quirk.

Also, if you get UDP completely right, do you need any other IP protocols? The whole point of UDP is programming directly to the datagram interface. Before IPv6 you could even disable the checksums.
Well about those new IP protocols...
It's been a while since I've thought about this; thanks for the refresher.
MSS was also super annoying for me doing re-encapsulation of TCP packets! We wanted to do eBPF cut-through routing of TCP connections for WebRTC stuff, where proxy bounces would be problematic because connections need to live a long time. If you're shuttling packets around, you're going to eat into the MTU with your own headers. 99.9% of our TCP connections weren't cut through so we don't want to dial in new settings into VMs for that feature, so we did it in eBPF, and parsing/adjusting TCP headers in BPF C (pre-bounded loops!) wasn't fun.
Which is why game networking libraries put a lot of emphasis on NAT traversal, forcing NATs to recognise the "connection". And why game console manufacturers tell users to just forward all incoming traffic unmanaged by the NAT to the console.