Hacker News new | ask | show | jobs
by shipit1999 882 days ago
OpenTelemetry is a great concept, but in my experience not quite there yet. Docs especially fall into the common trap of handling the happy path hello world quickstarts, then become increasingly useless as you want to get beyond that to real life use cases. Given the inherent tradeoff of complexity that comes from trying to unify different approaches around one standard, sometimes it seems like things that should be simple are more difficult than they should be. I'm sure it will keep improving.
2 comments

> Docs especially fall into the common trap of handling the happy path hello world quickstarts, then become increasingly useless as you want to get beyond that to real life use cases.

Yeah, Java is what I'm most familiar with. The "Getting Started" shows how to do some basic manual instrumentation and collect the output with curl. Then the "Next Steps" are just random things with no guidance about why I would or wouldn't choose any of them for my next step.

But, ok, I choose "Automatic Instrumentation", that sounds promising. And it actually is really easy to set up auto instrumentation. But then at the end it says

> After you have automatic instrumentation configured for your app or service, you might want to annotate selected methods or add manual instrumentation to collect custom telemetry data.

Uh... no... after I have automatic instrumentation enabled I want to do something with the output

The two major flaws in the docs seem to be

1. The common failure of docs to explain to users why they might choose one thing or another. "If you want to do x.. If you want to do y.." what if I don't know?

2. Because otel is agnostic to the consumer of the output, there's very little in the way of explaining how to get value out of what otel produces. To connect the dots, you really need to use the docs of your observability tool. Which I understand, but then most of them have their own setup directions because they want some extra fields included in the data, or they have their own fork, so not everything in the otel docs is actually usable.

I'm not sure what the answer is. It's not like I expect otel to document how to build a dashboard in Grafana. And a lot of frustration I've experienced has been with the observability tools themselves. But at the same time, I always feel like the otel docs just don't get you anywhere close to getting value out of the library. Which is a shame, because turning on auto-instrumentation and seeing all your traces with literally no extra work is a magical moment.

> 1. The common failure of docs to explain to users why they might choose one thing or another. "If you want to do x.. If you want to do y.." what if I don't know?

Observability docs in general struggle with this. So many data sources can emit so many types of metrics in so many formats, and every tool makes this impossible promise of consolidating it all into one space seamlessly. But tools like Grafana pride themselves so much on visualizing _anything_ that they paint themselves into a corner where they can't be prescriptive about common uses or methods without excluding or confusing others.

So a lot of the prescriptive answers to "what if I don't know?" gets chucked onto account and support teams of commercial vendors, because the docs can't anticipate every possible context in which an observability tool will get deployed. Each solution ends up being custom tailored and poorly portable to anyone else's, often not even to other customers with the same data sources and goals at the same scale due to wacky labelling differences or legacy requirements or some internal stakeholder demand.

More narrowly focused tools don't have as many of these problems, but not many organizations want narrowly focused observability tools. (Lots of _people_ do, but orgs don't want to pay out deals to multiple vendors for what looks like different flavors of the same result. And hey look it's Grafana Cloud or Datadog or whatever, it can do _anything_, so you devs and also bizops and SRE and IT and hey sales wants a dashboard too and so does the company cafeteria, why not, you all can just use this one tool and we just deal with one bill with a volume discount, right? Right??)

Smarter tools don't have as many of these problems by papering over the docs limitations by being better able to anticipate or surface connections between data sources, metrics, logs, traces, events, etc., and does so with better interfaces. But especially for high-cardinality data the usability of those tools either seems to fall apart or their companies charge Datadog-sized invoices.

Are there narrowly focused tools in the observability space even?

I was shopping for one after being outside of this field for a while, and they all do the 101 features and the kitchen sink model, which adds onto the complexity. DataDog, Grafana, but also the open source ones like SigNoz itself.

Ages ago it was all about metrics, today it's metrics traces logs APM alerting exceptions and a dozen other acronyms, on top of the protocols (statsd, Prometheus, OpenTelemetry), paired with crazy complicated yet unwieldy graph building UIs. Let's not even talk about pricing models. The entire business model is based around having one more checkmark in the feature list than the competition. The wire format (OpenTelemetry) has never been the pain point in this space.

For a moment, I seriously considered just going back to the 2000s and using RRDtool.

Most new observability tools start narrow but every economic incentive is to expand. Which makes sense, really, because most production systems people have are complicated as all hell and have tons of different needs. Some tools are better than others at containing the chaos -- I will humbly submit that the one I work for, Honeycomb, is one of the best at doing this -- but support for several telemetry signals, visualization tools, alerting systems, dashboarding systems, etc. are all what people eventually ask for as they roll out observability to more of their production systems.

Put differently, when you have sufficient observability of your entire system, you now have a complete abstraction of that system represented in some other UI and data streams. There's just no way out of the fact that for larger systems, this will be complicated, and the tools that can represent this reality must also be complex.

Hmmm... Yeah I setup open telemetry for a couple personal projects this year was pleased with the ease of setup, but by and large I knew what I was doing specifically I had my application, and I had Grafana and I wanted to get traces from A to B.

Relooking at the docs from the eyes of a newcomers if you don't already have a destination in mind they don't really help you. It's a little tricky because my setup with Grafana will be somewhat different (but similar) from someone using honeycomb or signoz or what have you, but even just having a "want to visualize your data? Check out the list of compatible vendors", with a link that direction would probably go a long way.

By comparison, I wanted to use opentelemetry for a series of projects, but could find absolutely no useful documentation on how to do anything else other than "send data from a webapp to a server / other cloud service that some vendor wants to sell you".

All I wanted to do was instrument an application and write its telemetry data to a file in a standard way, and have some story regarding combining metrics, traces, and logs as necessary. Ideally this would use minimal system resources when idle. That's it.

It doesn't read from files unfortunately, but https://openobserve.ai/ is very easy to set up locally (single binary) and send otel logs/metrics/traces to.

Here's how I run it locally for my little shovel project - https://github.com/bbkane/shovel#run-the-webapp-locally-with... .

Also linked from that README is an Ansible playbook to start OpenObserve as a systems service on a Linux VM.

Alternatively, see the shovel codebase I linked above for a "stdout" TracerProvider. You could do something like that to save to a file, and then use a tool to prettify the JSON. I have a small script to format json logs at https://github.com/bbkane/dotfiles/blob/2df9af5a9bbb40f2e101...

That's actually a neat little analysis platform, thanks!

Amusingly I can run my application, if I generate custom formatted .json and write it to a file, I can bulk ingest it... which is pretty much what I do now without the fancy visualization app. I think this speaks to my point that the OpenTelemetry part of the pipeline wouldn't be doing much of anything in this case. (The reason I care about files is that applications run in places where internet connectivity is intermittent, so generating and exporting telemetry from an application/process needs to be independent from the task of transferring the collected data to another host.)

For that use-case, you almost want the file to be rotated daily and just ... never sent ... at least until a customer has an issue, or you're investigating that hardware.
maybe part of the issue is that all the vendors working on it usually have time limits for ingesting data into their backends (like timestamps must be no more than -18/+2h from submission time) so they don't really care about it.
The major tracing library in Rust suggests a consumer that prints to stdout, but it's at the end of the introductory documentation; https://docs.rs/tracing/latest/tracing/

EDIT: it's what I've used when bridging between "this is a CLI app for maybe 3 people" and "this will need to be monitored"

Time and again I ran into two or three examples in different docs, and search engines sending me to the nonfunctional or ambiguous ones, complaining about it and having someone send me to a whole other doc I’ve never seen that is 3 clicks away from the overview doc while the broken ones are 0-2 clicks away.

It’s all moving too fast and yet not fast enough.