|
|
|
|
|
by mitchs
2220 days ago
|
|
Something worth knowing: TCP checksums should not be relied upon. If you aren't using TLS, your application needs to do its own checksums. A 16 bit 1's complement sum over a packet is not sufficient, especially given modern switch ASIC design. They take really fast signals off the fiber and turn them into slow moving, parallel signals within the ASIC. Often these signal buses are nice even numbers of bytes wide, 204 in this example. When there is something flawed in the chip in the slower moving internal pipeline, it will hit the same bit position every 204 bytes. If it is borked enough to flip two bits within the packet, those flips will be in the same bit position a multiple of 204 bytes away, meaning the same bit position in the 1's complement checksum. If one flips up, and the other flips down, it passes! In my case it ended up corrupting data in a BGP session's TCP stream, executing the world's most confusing route hijack in our network. |
|
Turned out our cloud provider's networking gear had a bug that disabled ECC and there was a bit flip happening. Convincing the provider's support that we had found faulty hardware in their datacenter was an interesting journey.