Hacker News new | ask | show | jobs
by chirdeeps 99 days ago
OpenTelemetry and standard observability stacks are great for seeing the latency and token counts of individual LLM calls, but they break down when you try to debug the coordination between agents.The hardest failure mode we've had to debug isn't a single agent hallucinating; it's Agent A correctly doing its job, but passing slightly malformed state to Agent B, which then confidently executes a destructive action based on that bad state. By the time you see the error, the root cause is three steps up the chain.Tracing doesn't solve this because it just shows you the execution path, not the authority boundary. What you actually need is a way to enforce contracts between agents—an execution layer that says "Agent B cannot accept this payload from Agent A unless it meets X criteria, and if it fails, rollback Agent A's last action." Until we treat multi-agent systems as concurrent state machines rather than just chained API calls, debugging them is going to remain a nightmare.
2 comments

The “authority boundary” framing is really helpful — tracing explains what happened, but not whether a transition between agents should have been allowed.

Curious how teams are handling this today — are those contracts usually defined explicitly (schemas / validators), or are they mostly implicit in the agent code and discovered only after failures?

If you can't trace across agents (like services), then you haven't set up OTEL completely

What your hard fail is, that's at a different layer of control, separate from OP questions about just seeing it so you can design those control systems. That's more guards, validators, and the like (more subagents)

I stay more human in the loop because these things are not ready for prime time the way you describe using them. That's burning tokens on average imo.

That makes sense — sounds like a lot of this is handled at the framework + design level in your setup.

In practice, when something does go wrong in a multi-step workflow, do you typically rely on tracing + manual debugging, or do you have built-in mechanisms for partial replay / recovery?