Hacker News new | ask | show | jobs
by chrisdsaldivar 2705 days ago
Yes: reliability, monitoring, and error handling were the types of things I’m looking for more information on. Do you have any recommendations for more information on these topics? I should have clarified that my question was geared towards important concepts agnostic of languages/frameworks/etc. This is a great list of further reading, thank you.

Also what does observability mean is this context?

3 comments

"Also what does observability mean is this context?"

Something went wrong, and now your site is serving 500 server errors to everybody at the rate of 25,000 per minute. The ops team already tried "just reboot it" and it didn't help. How are you going to figure out what is going on and fix it?

It's (mostly) too late to add anything, so all you've got is the logs you already had, the metrics you already had, etc. That's the "observable" stuff in a system. There's an art to recording what it is you need to know, while at the same time recording so much that you can't find what you need in the mess.

(The "mostly" is that if you have a good enough setup, you might be able to bring up a new system and route some very small fraction of traffic to it to examine it more intensely in real-time with a debugger or something, though in my experience, on those occasions I've had the opportunity to try this, it's never been a problem that would manifest on a new system receiving a vanishing fraction of a percent of the scale of a production box. But maybe you'll get lucky.)

You certainly want to do everything you can to not be in that mess in the first place, but it won't be enough. You need a system sufficiently observable that you can find the problem and find some sort of solution.

Oh thank you, I didn't know that was referred to as "observability" I thought it was just logging. This article from Etsy's engineering blog [1] was part of the inspiration for this question. Funnily enough when I googled "Etsy engineering logging" the 5th result was for a position on Etsy's observability team.

[1] https://codeascraft.com/2011/02/15/measure-anything-measure-...

I think of observability as a triad:

- logging (ex tools: Splunk, Sumologic, LogDNA)

- metrics (Prometheus, datadog, Grafana)

- tracing (lightstep, new relic, zipkin)

As mentioned above, observability is the data collected about a system.

When it comes to "measure everything" I've found services that have clients that already grok popular frameworks to be a godsend. We use NewRelic and it's abilty to automatically insturment all rest apis and db transactions is delightful. I could not imagine going back to having to do it manually or guess what information might be useful later.
You might want to look into honeycomb.io and follow Charity Majors on Twitter. Heck, just follow Charity anyway - she's a genius.
jerf answered observability well in another reply to this comment.

As for reliability, monitoring, and error handling I've heard good things about the Google SRE book: https://landing.google.com/sre/books/

I haven't read it personally, but I've heard good things from others and looking over it briefly the advice there lines up with what I've experienced in practice.