Hacker News new | ask | show | jobs
by ngbronson 4226 days ago
One of the reasons this took a long time to figure out is that it was a failure amplifier, so there was always a more typical network problem preceding it. Network failures in a data center cause lots of changes to the packets, because of retries, failover, and automatic load balancing, so there were a lot of trees to look at.
1 comments

That makes a lot more sense. It would have been nice to include some of the troubleshooting process so people can learn from that too. Thanks for sharing!