Hacker News new | ask | show | jobs
by mitchs 2220 days ago
Something worth knowing: TCP checksums should not be relied upon. If you aren't using TLS, your application needs to do its own checksums. A 16 bit 1's complement sum over a packet is not sufficient, especially given modern switch ASIC design. They take really fast signals off the fiber and turn them into slow moving, parallel signals within the ASIC. Often these signal buses are nice even numbers of bytes wide, 204 in this example. When there is something flawed in the chip in the slower moving internal pipeline, it will hit the same bit position every 204 bytes. If it is borked enough to flip two bits within the packet, those flips will be in the same bit position a multiple of 204 bytes away, meaning the same bit position in the 1's complement checksum. If one flips up, and the other flips down, it passes! In my case it ended up corrupting data in a BGP session's TCP stream, executing the world's most confusing route hijack in our network.
3 comments

One of our favorite troubleshooting stories was "3% of TLS connections fail on this particular frontend IP address. HTTP works."

Turned out our cloud provider's networking gear had a bug that disabled ECC and there was a bit flip happening. Convincing the provider's support that we had found faulty hardware in their datacenter was an interesting journey.

I signed up to thank you for this insight. It's the kind of thing you would eventually find out on your own too but having read it somewhere in words saves time when something wonky happens somewhere and the data supports the conclusion. There are people who would swear by the chips/ hardware not being an issue (which is usually correct but not always).

I wonder, why would somebody use 204 bytes -> 1632 bits, why not less (why not more for e.g. jumbo frames). Is there some data sheet / source that you would recommend?

Ah... Network hardware troubleshooting. Truly, it separates the frustration tolerant from the frustration intolerant. If not because of the challenge of having to take into account things mist programmer's treat as invisible, but because it usually involves at least one conversation with another network provider to check their stuff; which generally leads to the most passive aggressive dodging of blame until you get that one operator who's one goal in life is to keep the Internet running correctly.