Hacker News new | ask | show | jobs
by technimad 1235 days ago
Why sample otel spans and miss out on the important ones?
3 comments

You can sorta have your cake and eat to too.

Firstly, not all spans are interesting. When 99.99% of your traffic is just going to serve up an HTTP 200 within your acceptable latency threshold, you don't need every one of those. You probably do want to keep 100% of error spans, or those where the root has a duration beyond a configured threshold. There's tools to be able to sample that way.

Secondly, there's ways to also attach your effective sample rate as metadata to spans, and if there's a backend that supports re-weighting counts based on that, you can still get accurate all-up counts of overall traffic.

Admittedly, OTel and many other backends don't have the best story for this yet. But it's getting better.

While I would like to ingest every one, cost is a factor.

Even if we were self-hosting, there's a cost to ingesting and storing every single span.

And even if we are able to pay for ingesting 100%, not everything is practical to be ingested 100%. Our most common request type (heartbeat) generate a span payload size that is a multiple of the original request. We're using Elixir in production, and those can absorb a tremendous amount of traffic, saturating the entire CPU capacity of the hardware if we let it. The agents are not capable of keeping up.

Because of cost?