Hacker News new | ask | show | jobs
by fishtoaster 682 days ago
For what it's worth, I found it almost trivial to set up open telemetry and point it honeycomb. It took me an afternoon about a month ago for a medium-sized python web-app. I've found that I can replace a lot of tooling and manual work needed in the past. At previous startups it's usually like

1. Set up basic logging (now I just use otel events)

2. Make it structured logging (Get that for free with otel events)

3. Add request contexts that's sent along with each log (Also free with otel)

4. Manually set up tracing ids in my codebase and configure it in my tooling (all free with otel spans)

Really, I was expecting to wind up having to get really into the new observability philosophy to get value out of it, but I found myself really loving this setup with minimal work and minimal koolade-drinking. I'll probably do something like this over "logs, request context, metrics, and alarms" at future startups.

1 comments

I've currently done this, and I'm seriously considering undoing it in favor of some other logging solution. My biggest reason: OpenTelemetry fundamentally doesn't handle events that aren't part of a span, and doesn't handle spans that don't close. So, if you crash, you don't get telemetry to help you debug the crash.

I wish "span start" and "span end" were just independent events, and OTel tools handled and presented unfinished spans or events that don't appear within a span.

Isn’t the problem here that your code is crashing and you’re relying on the wrong tool to help you solve that ?
Logging solves this problem. If OTel and observability is attempting to position itself as a better alternative to logging, it needs to solve the problems that logging already solves. I'm not going to use completely separate tools for logging and observability.

Also, "crash" here doesn't necessarily mean "segfault" or equivalent. It can also mean "hang and not finish (and thus not end the span)", or "have a network issue that breaks the ability to submit observability data" (but after an event occurred, which could have been submitted if OTel didn't wait for spans to end first). There are any number of reasons why a span might start but not finish, most of which are bugs, and OTel and tools built upon it provide zero help when debugging those.

OTel logs are just your existing logs, though. If you have a way to say "whoopsie it hung" then this doesn't need to be tied to a trace at all. The only tying to a trace that occurs is when there's active span/trace in context, at which point the SDK or agent you use will wrap the log body in that span/trace ID. Export of logs is independent of trace export and will be in separate batches.

Edit: I see you're a major Rust user! That perhaps changes things. Most users of OTel are in Java, .NET, Node, Python, and Go. OTel is nowhere near as developed in Rust as it is for these languages. So I don't doubt you've run into issues with OTel for your purposes.

Can you give an example of an event that's not part of a span / trace?
Unhandled exceptions is a pretty normal one. You get kicked out to your app's topmost level and you lost your span. My wishlist to solve this (and I actually wrote an implementation in Python which leans heavily on reflection) is to be able to attach arbitrary data to stack frames and exceptions when they occur merge all the data top-down and send it up to your handler.

Signal handlers are another one and are a whole other beast simply because they're completely devoid of context.

Two good examples - thank you.

They're icky (as language design / practices) to me precisely because you end up executing context-free code. But I'd probably also just start a new trace in my signal handler / exception handler tagged with "shrug"...

Can't you close the span on an exception?
See https://news.ycombinator.com/item?id=41205665 for more details.

And even in the case of an actual crash, that doesn't necessarily mean the application is in a state to successfully submit additional OTel data.