|
In a previous life as a full-stack Engineer at a startup, this was my white whale. The state of logging, monitoring, and alerting was such that signal quality was low, and only indirect observations of the system were possible since the logging was borderline useless. The result was multiple pages per night, with each one resulting in a scavenger hunt because signal was so low that it was nigh impossible to even identify what playbook to run. For example, the web application crashing was logged as a DEBUG statement, but starting was logged at an ERROR level. This was clearly done at some point because DEBUG generated far too much log info w/millions of active users, but some Engineer wanted to know that the app started. Gross. I solved for this by doing a couple things. The first was to define standards for log levels, ability to correlate log statements with each other for a given request, and to define the level of context a "proper" log level should provide. For example, FATAL = there's no way anything can work properly. These are pretty rare, but incorrect configuration values were a common culprit. ERROR indicates something, possibly transient going wrong. Every now and then, not a big deal that can wait until later, but a rapid accumulation could mean something more serious is going on. INFO contained information about the state of the system, such as general measures of activity and other signals to indicate the system is working as expected. Most of our metrics capture was instrumented based off these statements. In terms of the messages, we rapidly evolved the quality of the messages. For something like the aforementioned configuration error, the system initially just spat out an "Unexpected error" and a module name. The first improvement then stated something like "invalid configuration value" and finally we ended up on a message that stated the value was incorrect, identified which configuration value was wrong, and had a code that referenced documentation and escalation owner. When all was said and done, we'd reduced our downtime from hours per year to less than 5 minutes, eliminated over 95% of our pages, and reduced escalations to Engineering from several days per week to a level where it was hard to remember the last one. As the head of Engineering, I had to fight an uphill battle against the product & sales team for almost a year to make all of this happen, but I was fully vindicated when we were acquired and our operational maturity was lauded during the due diligence process. |