| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bazzargh 3973 days ago

This might work at low traffic levels, but there are better ways. In our internal error catcher, we've seen deploys that caused 20,000+ errors in under a minute. Even when we used similar techniques to mail ourselves errors on a low traffic site, it was important to introduce dead time, not alerting again for (say) 5 mins after the first alert so as not to flood our inbox.

A useful technique btw is to use a ring buffer to collect debug logging, and when an error occurs dump this buffer into the message along with the stacktrace. It gives you more context when you're not logging debug to disk, and is fast. (see eg http://www.exampler.com/writing/ring-buffer.pdf, https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4...)

Services like https://www.pagerduty.com/ can contact you on more channels when problems happen, and also deal with dead time etc not /re/ alerting you constantly. The Slack mobile client would have got every one of those messages as a push.

Pagerduty won't help with aggregating/exploring stacktraces though, for that, there's eg https://airbrake.io or for mobile apps, https://try.crashlytics.com/.

Finally there's also http://www.splunk.com/ for aggregating logs; you can build some quite complex queries from it and do alerting on the results (not quite as fancy as pagerduty, but functional).

There are many other tools in this space, worth looking around to see if there's any SaaS you can use or crib ideas from.

1 comments

bkucukguzel 3973 days ago

Thank you a lot for you comment. I am going to check this tools.

link