| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fatnoah 1164 days ago

In a previous life as a full-stack Engineer at a startup, this was my white whale. The state of logging, monitoring, and alerting was such that signal quality was low, and only indirect observations of the system were possible since the logging was borderline useless. The result was multiple pages per night, with each one resulting in a scavenger hunt because signal was so low that it was nigh impossible to even identify what playbook to run.

For example, the web application crashing was logged as a DEBUG statement, but starting was logged at an ERROR level. This was clearly done at some point because DEBUG generated far too much log info w/millions of active users, but some Engineer wanted to know that the app started. Gross.

I solved for this by doing a couple things. The first was to define standards for log levels, ability to correlate log statements with each other for a given request, and to define the level of context a "proper" log level should provide.

For example, FATAL = there's no way anything can work properly. These are pretty rare, but incorrect configuration values were a common culprit. ERROR indicates something, possibly transient going wrong. Every now and then, not a big deal that can wait until later, but a rapid accumulation could mean something more serious is going on. INFO contained information about the state of the system, such as general measures of activity and other signals to indicate the system is working as expected. Most of our metrics capture was instrumented based off these statements.

In terms of the messages, we rapidly evolved the quality of the messages. For something like the aforementioned configuration error, the system initially just spat out an "Unexpected error" and a module name. The first improvement then stated something like "invalid configuration value" and finally we ended up on a message that stated the value was incorrect, identified which configuration value was wrong, and had a code that referenced documentation and escalation owner.

When all was said and done, we'd reduced our downtime from hours per year to less than 5 minutes, eliminated over 95% of our pages, and reduced escalations to Engineering from several days per week to a level where it was hard to remember the last one.

As the head of Engineering, I had to fight an uphill battle against the product & sales team for almost a year to make all of this happen, but I was fully vindicated when we were acquired and our operational maturity was lauded during the due diligence process.

2 comments

peteradio 1164 days ago

You know all that work was worth it when you get a good lauding.

link

dgunay 1164 days ago

Going through something like this as a SWE at a startup. Lots of noise in our alerts and logging, so alert fatigue is a real problem. Do you have any advice on navigating this scenario (esp. negotiating with product to get monitoring and ops in a usable state)

link

raldi 1164 days ago

Sure, just give your manager a copy of the bible: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa...

link

dgunay 1164 days ago

Thanks, this was a very enlightening read. Getting product on board with the labor involved in implementing this is going to be a different story though.

link

raldi 1164 days ago

Another good piece about negotiating with Product is written up here:

https://sre.google/sre-book/introduction/#:~:text=Pursuing%2...

Ultimately, it's Product's job to decide how they want to balance reliability and feature-shipping speed. Work with them to define an SLO (like, in 99.995% of five-minute timeslices of any given month, 99% of all queries will complete within 250msec) and then graph how well you're doing when it comes to hitting it.

If you're failing to keep things above that line, Product either needs to accept lower reliability standards or invest engineering time in improving reliability. Again, it's Product's call to make. If they do want to invest in reliability, though, that's when you get to present your wish list, work out an agreement on its ranking, and find time to get the work done, even if it means slowing down the rate at which new features are shipped.

link

hallway_monitor 1164 days ago

You may have luck if you frame it in terms of an investment. Spend the time now to fix your alerts, add playbooks, improve process - because you immediately start enjoying the benefits. Less time spent on support means higher velocity. The longer you wait the more engineering time you've wasted It just takes a little patience up front as well as product and engineering collaborating.

link