Hacker News new | ask | show | jobs
by xinweihe 315 days ago
Good question! Your setup already covers a lot — but TraceRoot tries to go a bit further in a few areas:

In TraceRoot, we organize all logs, metrics, etc. around traces and build an execution tree. This structured view makes it much easier for our agent to reason through the large amount of telemetry data using context-aware optimizations. (We plan to support slack and notion integration as well.)

It’s not a one-off tool. TraceRoot is a live monitoring platform. It continuously watches what’s happening in prod. So when something breaks, the agent already has full team-visible context, not just a scratchpad session spun up in the moment.

Down the line, we're aiming for automatic bug detection and remediation - not just smarter copiloting, but proactive debugging workflows. The system also retains team-level memory of past bugs, fixes, and infra quirks, so the agent gets smarter over time.

We’ve open sourced a lot of the core. Would love to jam on this if you're up for it. Always down to trade ideas or even hack on something together!

1 comments

I don't understand - otel does that unification already. Traces connected to logs etc.. I'm still missing something...
Thanks for the follow-up. Let me try to clarify!

When we say we "organize all logs, metrics, and traces", we mean more than just linking them together (which otel already supports). What we’re doing is:

- context engineering optimization: We leverage the structure among logs, spans, and metadata to filter and group relevant context before passing it to the LLM. In real production issues, it's common to see 10k+ logs, traces, etc. related to a single incident — but most of it is noise. Throwing all that at agents usually leads to poor performance due to context bloat see https://arxiv.org/pdf/2307.03172. We're working on addressing that by doing structured filtering and summarization. For more details see https://bit.ly/45Bai1q.

- Human-in-the-Loop UI: For cases where developers want to manually inspect or guide the agent, we provide a UI that makes it easy to zoom in on relevant subtrees, trace paths, or log clusters and directly select spans to be included in the reasoning of agents.

The goal isn't just unification, it's scalable reasoning over noisy telemetry data, both automated and interactive.

Hope that clears things up a bit! Happy to dive deeper if useful.

The second link helps

It's interesting to wonder if 80% of the question answering can be achieved as a prompts/otel.md over MCPs connected to Claude Code and let agentic reasoning do the rest

Ex:

* When investigating errors, only query for error-level logs

* When investigating performance, only query spans (skip logs unless required) and keep only name, time. Linearize as ... .

* When querying both logs & traces, inline logs near relevant trace as part of an llm-friendly stored artifact jobs/abc123/context.txt

Are there aspects of the question answering (not ui widgets) you think are too hard there?

Yes, we can connect for example CC with MCPs. But this may not work well for example if user wants to check the latency for previous 10 days error log on function A. By using MCP the agent needs to get 10 days error logs at first and then somehow get the latency and correlates them, apply filters for function A. IMO it will hallucinate a lot if there are too many tools, logs and traces. But on TraceRoot platform we "mixed" all necessary data at first, and based on user's query apply filters on structured data, which is more accurate, straightforward and efficient. Here is the README of the general design https://github.com/traceroot-ai/traceroot/tree/main/rest/age...