Hacker News new | ask | show | jobs
by khaless 1998 days ago
While it's nice to get paged, and look at every 5xx error; it doesn't really scale all that well once you get past a certain point, particularly if your application is gracefully degrading.

That said, I love the wisdom in your comment that you find all sorts of super rare bugs, or conditions that could seriously effect performance, or availability if they become more common (which they often do). Past a point, I've found that an approach which works well is to encourage engineers/operators to drive by metrics, and pay close attention over time to p100's (max), as you've suggested with your 500 errors. Lots of goodies can be hidden behind them, just like you've found with the 500 errors.