Hacker News new | ask | show | jobs
by dlowe 4778 days ago
I'm not sure if I can do any better than "troubleshooting can be hard", frankly. The actual details are all tangled together in a way that resists summary.
1 comments

Just tell customers that there were queue backlogs caused by slow git clones that were exacerbated by server failures that occurred due to kernel panics and LVM snapshot problems. These were resolved, but due to MTU configuration changes made during troubleshooting there were further outages; later on an unrelated bug in schejulur caused another outage.

However, all these issues are now resolved and your service is far more robust because of it.