Hacker News new | ask | show | jobs
by chris_wot 4772 days ago
Troubleshooting can be a bitch.

Could you add a tl;dr though?

1 comments

I'm not sure if I can do any better than "troubleshooting can be hard", frankly. The actual details are all tangled together in a way that resists summary.
Just tell customers that there were queue backlogs caused by slow git clones that were exacerbated by server failures that occurred due to kernel panics and LVM snapshot problems. These were resolved, but due to MTU configuration changes made during troubleshooting there were further outages; later on an unrelated bug in schejulur caused another outage.

However, all these issues are now resolved and your service is far more robust because of it.