Hacker News new | ask | show | jobs
by natdempk 2706 days ago
jerf answered observability well in another reply to this comment.

As for reliability, monitoring, and error handling I've heard good things about the Google SRE book: https://landing.google.com/sre/books/

I haven't read it personally, but I've heard good things from others and looking over it briefly the advice there lines up with what I've experienced in practice.