Hacker News new | ask | show | jobs
by ogjunkyard 1681 days ago
I enjoyed the article a fair bit!

I was wondering, how should one get started with observability and implementing it? Are there specific books/courses/talks you'd recommend?

The reason I ask is because I've never been directly exposed to good observability at any of the companies I've worked at for a handful of reasons. It mostly boils down to the fact that I'm a DevOps engineer, so building observability is a set-up-and-keep-running sort of deal for other teams, not a useful-for-my-applications thing that I'm going to be working with often. Teams let us know "Splunk is down", "I can't reach Kibana", or "Looks like disk space is filling up" and that's about it after it's been initially set up.

There's a whole host of questions I'd ideally like to answer, but a lot of it boils down to the fact that I don't know what I don't know and I'd suggest assuming I know nothing over assuming I know something because I know the word. Questions I'd like to be able to answer are:

- What makes a good log? - What is a trace? Why is it useful? How does it help me debug issues faster? - How do you increase observability for loosely-coupled microservice systems? - How do you observe multi-threaded applications? - ... and I'm sure there are a whole bunch more.

2 comments

Urgh, yes. This is a really difficult place to be coming from, and I totally feel your pain!

My background was working as an SRE at GoCardless, starting when the company was around ~30 and leaving at around 700. During that time we did the whole "oh crap, wtf is observability" that coincided with a big push in the industry to define the term, and I worked on the team (Observability Working Group) that tried rolling out these practices.

The truth is this is much easier if you have someone with you who knows what good is, though I was in your position at GC and it's possible to learn it by first principles.

If you're doing this, the best advice I can give you is to think really critically about _why_ you want observability.

Usually it's "when something goes wrong, I want to be able to understand what lead to it, and what was going on at that time". If that's the case, you can't make a wrong step if it improves your ability to understand that- even if what you do is simple.

At GC we began with logs, as everyone was familiar with them. We encouraged people to start thinking about logs as structured data, so drop the "Posted message to Slack" log line and go for something like:

``` { event: "slack_message.posted", slack_channel_id: "CH123", slack_user_id: "US123", etc... } ```

When you get your logs looking like that, you can setup something like Grafana to expose visualisations that are built from your logs. We were using ElasticSearch for log storage, which is quite simple to build graphs on top of.

Visualisations are really compelling, and help you persuade people it's worthwhile to consider this stuff.

Beyond structured logging, you'd want to look into time-series metrics (Prometheus) which can help you monitor things in a bit more real-time, then traces if you want that type of insight.

I've often compared observability to testing, in terms of how you should think about it/use it. You'll find a load of dev teams who think testing is a waste of time, but most high performance teams won't ship without tests.

They'll say testing doesn't just help catch errors, it helps them build faster, due to the confidence it gives them.

You'll know when your org has adopted observability when they feel that way about instrumenting their code, and it's second nature to write log/trace/metrics into their software.

Not sure I have any reference links in mind just yet, but I'll give it a think.

Thanks so much for this response!

I've been thinking on your response for probably over an hour as I've been going about my day, and the thing that is sticking out to me is your directive to think critically about WHY I want observability. I think I figured out the motivation on why I'm looking into all of this stuff.

I have a side business I'm working on that causes me to think about the customer experience a lot since it's a fully self-service, no-touch product where I'm not actively engaged in the sales, onboarding, etc. experience a new user has. When someone does have an issue, I want to be able to help them accomplish what they are trying to do as quickly as possible.

I recently had a user/friend who was trying to get something set up in the application I'm building. The only reason I knew he had an issue was because he reached out to me. Luckily, when I finally saw his message 4-5 hours later, he was around and able to work with me on troubleshooting his issue. It took me a bit to troubleshoot exactly what was going on and the friend was very patient/helpful the entire time. I remember having him try to initiate his request probably a dozen or so times as I worked through my application and teasing out the root cause of his problem. Ultimately, this led to me building in better error messages into my application to address this specific point, but if there's a way to get ahead of the user issue whack-a-mole game, I'm all for it.

Instead of him trying to reach out to me and us troubleshoot this issue together in real time, it would be more helpful to simply have had an Error Code and Request ID instead. This would allow me to instead tell him, "I dug into this and found out what's going on. Here's exactly what the issue is. Do X, Y, and Z to get this working."

Other points that particularly resonate with me, although I may not consciously know why are:

- JSON-structured logging

- Visualizations could help sell the idea of observability at $DAYJOB (but no clue what would make for a good graph/diagram/etc.)

- High-functioning teams want observability like high-functioning teams want automated testing.