|
|
|
|
|
by lcw
1302 days ago
|
|
Agreed, one of the craziest bugs I had to deal with was we had a distributed system using lots of infrastructure. Said distributed system started having trouble communicating with random nodes and sub-systems. I spent 3 hard days finding a Linux kernel bug where the ARP cache was not removing least recently accessed network addresses. Normally, this wouldn't be a big deal for a typical network because few networks would fill up the default arp cache size. That was even true for ours except that we would slowly add and remove infrastructure over the course of a couple months until eventually the ARP cache would fill and remove the random network devices... It wasn't even our distributed application code... Some bugs take time to manifest themselves in very creative ways. |
|
Turns out we accidentally stretched one server VLAN too wide, to roughly 600 devices within one VLAN within one switch. The servers had more-or-less all-to-all traffic, and that was enough to generate so many ARP requests and replies that the switch's supervisor policer started dropping them at random, and after ten failed retries for one server the switch just gave up and dropped it from the ARP table.
Of course the control plane policer is global for the switch, so every device connected to the switch was susceptible, not just the ones in the overextended VLAN.