Hacker News new | ask | show | jobs
by tedunangst 1047 days ago
But TCP fragments in the same way?
1 comments

TCP does not use IP fragmentation, and the IP packets are marked "Don't fragment". TCP performs its own fragmentation and every packet gets a TCP header in its leading section. A NAT, Firewall, or end-host can L4 route the TCP packet as-is and does not need to correlate with other packets.

Edited to extend: this is why TCP has a "Maximum Segment Size", and why Path MTU Discovery information has to be passed into the TCP state machine. It is TCP that takes responsibility for carving up the data into the packets, not IP.

One of the goals of UDP was to avoid needing this kind of state, which is why the IP layer handles fragmentation for it instead. This is allowed on a hop-by-hop basis, unless the DF bit is set; so when a "too big" packet gets to a node with a smaller MTU, it can just split it and send on the fragments. No PMTUD needed.

The design could have been for the fragmenting node to also add a UDP header as part of that process, but was not. It would have been a simple change at the time. It's had a lot of consequences since and is responsible for a decent amount of complexity in hardware and software packet pipelines.

It could not have copied the UDP header. Otherwise you wouldn't be able to put any new protocol on IP without teaching it to every router.
Several other protocols solve this in a layering agnostic way by simply having a header length field. The header bytes can then be copied without any understanding of the format. This is even how IP's own ICMP protocol knows how much of an IP packet it should (at least) include in an error message so that the sender can know what triggered the error.

TCP, UDP, ICMP and IP were all designed contemporaneously; UDP fragmentation could also easily have just been specified for. It's just an odd regrettable quirk.

Also, if you get UDP completely right, do you need any other IP protocols? The whole point of UDP is programming directly to the datagram interface. Before IPv6 you could even disable the checksums.
Well about those new IP protocols...
It's been a while since I've thought about this; thanks for the refresher.
MSS was also super annoying for me doing re-encapsulation of TCP packets! We wanted to do eBPF cut-through routing of TCP connections for WebRTC stuff, where proxy bounces would be problematic because connections need to live a long time. If you're shuttling packets around, you're going to eat into the MTU with your own headers. 99.9% of our TCP connections weren't cut through so we don't want to dial in new settings into VMs for that feature, so we did it in eBPF, and parsing/adjusting TCP headers in BPF C (pre-bounded loops!) wasn't fun.