|
|
|
|
|
by chirdeeps
99 days ago
|
|
OpenTelemetry and standard observability stacks are great for seeing the latency and token counts of individual LLM calls, but they break down when you try to debug the coordination between agents.The hardest failure mode we've had to debug isn't a single agent hallucinating; it's Agent A correctly doing its job, but passing slightly malformed state to Agent B, which then confidently executes a destructive action based on that bad state. By the time you see the error, the root cause is three steps up the chain.Tracing doesn't solve this because it just shows you the execution path, not the authority boundary. What you actually need is a way to enforce contracts between agents—an execution layer that says "Agent B cannot accept this payload from Agent A unless it meets X criteria, and if it fails, rollback Agent A's last action." Until we treat multi-agent systems as concurrent state machines rather than just chained API calls, debugging them is going to remain a nightmare. |
|
Curious how teams are handling this today — are those contracts usually defined explicitly (schemas / validators), or are they mostly implicit in the agent code and discovered only after failures?