Hacker News new | ask | show | jobs
by pja 5125 days ago
It looks like the error recovery code wasn't well tested. Error recovery code in distributed systems is some of the hardest code to test effectively mind.

The thundering herd of recovery is especially difficult to cope with: your error recovery code can work just fine for normal outages but then fail completely when faced with just a few more components going dark.