Hacker News new | ask | show | jobs
by fnordpiglet 1005 days ago
Tracing is poor at both very long lived traces, at stream processing, and most tracing implementations are too heavy to run in computationally bound tasks beyond at a very coarse level. Logging is nice in that it has no context, no overhead, is generally very cheap to compose and emit, and with including transaction id and done in a structured way gives you most of what tracing does without all the other baggage.

That said for the spaces where tracing works well, it works unreasonably well.

3 comments

I think Open Telemetry has solved the stream processing problem issue with span links[1]. Treating each unit of work as an individual trace but being able to combine them and see a causal relationship. Slack published a blog about it pretty recently [2]

[1] https://opentelemetry.io/docs/concepts/signals/traces/#span-...

[2] https://slack.engineering/tracing-notifications/

When I worked at ScoutAPM, that list is basically the exact areas where we had issues supporting. We didn't do full-on tracing in the OpenTracing kind of way, but the agent was pretty similar, with spans (mostly automatically inserted), and annotations on those spans with timing, parentage, and extra info (like the sql query this represented in Active record).

The really hard things, which we had reasonable answers for, but never quite perfect: * Rails websockets (actioncable) * very long running background jobs (we stopped collecting at some limit, to prevent unbounded memory) * trying to profile code, we used a modified version of Stackprof to do sampling instead of exact profiling. That worked surprisingly well at finding hotspots, with low overhead.

All sorts of other tricks came along too. I should go look at that codebase again to remind me. That'd be good for my resume.... :)

https://github.com/scoutapp/scout_apm_ruby

Hmmm, for long-lived processes and stream processing we use tracing just fine. What we do is make a cutoff of 60 seconds, which each chunk is its own trace. But our backend queries trace data directly, so we can still analyze the aggregate, long-term behavior and then dig into a particular 60 second chunk if it's problematic.
So, here are a few examples -

Suppose you have a long data pipeline that you want to trace jobs across. There are not an enormous number of jobs but each one takes 12 hours across many phases. In theory tracing works great here, but in practice most tracing platforms can’t handle this. This is especially true with tailed based tracing as traces can be unbounded and it has to assume at some point their time out. You can certainly build your own, but most of the value of tracing solutions is the user experience; which is also the hardest part.

On stream processing I’ve generally found it too expensive to instrument stream processors with tracing. Also there’s generally not enough variability to make it interesting. Context stitching and span management as well as sweeping and shipping of traces can be expensive in a lot of implementations and stream processing is often cpu bound.

A simple transaction id annotated log makes a lot more sense in both, queried in a log analytic platform.