Hacker News new | ask | show | jobs
by wdb 733 days ago
Personally, I like OpenTelemetry, nice standardised approach. I just wished the vendors would have better support for the semantic conventions defined for a wide variety of traces.

I quite like the idea of only need to change one small piece of the code to switch otel exporters instead of swapping out a vendor trace sdk.

My main gripe with OpenTelemetry I don't fully understand what the exact difference is between (trace) events and log records.

3 comments

> My main gripe with OpenTelemetry I don't fully understand what the exact difference is between (trace) events and log records.

This is my main gripe too. I don't understand why {traces, logs, metrics} are not just different abstractions built on top of "events" (blobs of data your application ships off to some set of central locations). I don't understand why the opentelemetry collector forces me to re-implement the same settings for all of them and import separate libraries that all seem to do the same thing by default. Besides sdks and processors, I don't understand the need for these abstractions to persist throughout the pipeline. I'm running one collector, so why do I need to specify where my collector endpoint is 3 different times? Why do I need to specify that I want my blobs batched 3 different times? What's the point of having opentelemetry be one project at all?

My guess is this is just because opentelemetry started as a tracing project, and then became a logs and metrics project later. If it had started as a logging project, things would probably make more sense.

> This is my main gripe too. I don't understand why {traces, logs, metrics} are not just different abstractions built on top of "events" (blobs of data your application ships off to some set of central locations).

By design, they cannot be abstractions of the single concept. For example, logs have a hard requirement on preserving sequential order and session and emitting strings, whereas metrics are aggregated and sampled and dropped arbitrarily and consist of single discrete values. Logs can store open-ended data, and thus need to comply with tighter data protection regulations. Traces often track a very specific set of generic events, whereas there are whole classes of metrics that serve entirely different purposes.

Just because you can squint hard enough to only see events being emitted, that does not mean all event types can or should be treated the same.

> Just because you can squint hard enough to only see events being emitted

If you squint hard enough you can fool yourself into thinking all metrics have the same availability requirements. It’s not the case. There are plenty of time series data metrics where arbitrarily dropping them or aggregating them would throw off your alerting entirely.

Indeed one would have to squint to the point of blindness.

Logs are single point in time, flat, linear sequence, never dropped (at best you'd collapse sequences of identical, repeated logs). Think dmesg, syslog, systemd journald/journalctl.

Metrics are statistical numeric data, which can be series, average, histogram, bucket... aggregation/reduction can be done on the fly/before leaving the observed thing. Some can be dropped, but it is important that dropping anything stays statistically meaningful.

Spans are a duration in time representing some operation, with metadata (numeric, stringy, structured even) attached pertaining to that operation. Spans have a parent, forming a tree, which forms a trace. Spans can be deduped and/or sampled, with specific occurences forcefully kept (e.g 500 error) or dropped (e.g healthcheck).

They are fundamentally different (technical) primitives a.k.a (functional) tools to observe different things and serve different goals.

Right, the point I’m making is logs, metrics, traces, these concepts are views of data, with a pretty hazy relationship to the shape of the data itself or the handling requirements. Any assumption you make about them as a category (logs are unstructured, traces are sampled, metrics can be aggregated) is wrong nearly as much as it’s right.
> Right, the point I’m making is logs, metrics, traces, these concepts are views of data (...)

Not really. Logs are fundamentally different than operational metrics, which are fundamentally different than business/behavioral metrics, which are fundamentally different than traces, etc etc etc.

This is not a matter of "view". It's the result of completely different system requirements. They are emitted differently, they are processed/aggregated differently, they are stored differently, they are consumed differently.

Even within business metrics types, which is already a specialized type of metrics, you have fundamentally different system requirements. Click stream metrics mix traits of tracing with logging and metrics, and have very specific requirements regarding data protection.

They are all distinct observability features. They are not the same. At all. This is not up for debate.

> If you squint hard enough you can fool yourself into thinking all metrics have the same availability requirements.

I'm sorry, I have no idea what point you tried to make.

Something I mention any time I'm introducing OpenTelemetry is that it's an unfinished project, a huge piece being the unifying abstractions between those signals.

In part this is a very practical decision: most people already have pretty good tools for their logs, and have struggled to get tracing working. So it's better to work on tools for measuring and sending traces, and just let people export their current log stream via the OpenTelemetry collector.

Notably the OTel docs acknowledge this mismatch between current implementation and design goals: https://opentelemetry.io/docs/specs/otel/logs/#limitations-o...

If you're using OTLP, SDKs only require you specify the endpoint once, the signal specific settings are for if you want to send them to different places.

The way you process/modify metrics vs logs vs traces are usually sufficiently different that there's not much point in having a unified event model if you're going to need a bunch of conditions to separate and process them differently. Of course, you can still use only one source (logs or events) and derive the other 2 from that, though that rarely scales well.

Plus, the backends that you can use to store/visualize the data usually are optimized for specific signals anyways.

Well, only when you use the OTLP protocol and otel-collector. In other cases you would need a (span) exporter to multiple targets at the same time. But yeah, otel-collector would be the best approach to achieve this.
It's a bit confusing but here's my best attempt to explain it:

- Trace events (span events) are intended to be structured events and possibly can have semantic attributes behind them - similar to how spans have semantic attributes. They're great if your team is all bought in on tracing as an organization. They will colocate your span events with your parent span. In practice they have poor searchability/indexing in many tools, so they should only be used if you only intend to use them when you will discover the span first. (Ex. debug info that is only useful to figure out why a span was very slow and you're okay not being easily searchable)

- Log records are plain old logs, they should be structured, but don't have to be, and there isn't a high expectation of structured data, much less semantic attributes. Logs can be easily adopted without buying into tracing.

- Events API, this is an experimental part of Otel, but is intended to be an API that emits logs with the expectation of semantic conventions (and therefore is also structured). Afaik end users are not the intended audience of this API.

Many teams fall along the spectrum of logs vs tracing which is why there's options to do things multiple ways. My personal take is that log records are going to continue to be more flexible than span events as an end-user given the state of current tools.

Disclaimer: I help build hyperdx, we're oss, otel-based observability and we've made product decisions based on the above opinions.

Can you give an example of the missing semantic conventions?