|
|
|
|
|
by anupsing_ai
129 days ago
|
|
Been building multi-agent systems for the past few months and keep running into the same problem: agent regressions that no traditional test catches.
Wrote up the failure modes I've actually seen. The short version:
Skill bloat is the sneaky one. You add tools over time, each makes sense individually, then at 30+ tools the agent starts picking wrong ones. Anthropic's team wrote about this too. A single Playwright MCP server burns 11.7K tokens on tool definitions alone, included in every request whether you use them or not.
Context rot caught me off guard. Chroma measured 18 LLMs and found 11/12 drop below 50% performance at just 32K tokens. We had a bug where adding a new capability pushed an older instruction into the "lost in the middle" zone. Took two days to figure out why an unrelated feature degraded.
The scariest one is when a prompt tweak makes your agent stop saying "I don't know" and start confidently making things up. Surface metrics go up. Accuracy goes down. Google's AI Overviews had this problem at scale.
Also covers cost creep (same traffic, 40% higher bill, nobody has a cost gate in CI/CD), multi-agent cascade failures (Berkeley's MAST paper found 14 failure modes across 7 frameworks, ChatDev hit 33% correctness), and model updates breaking things you didn't touch.
We have 20 years of great tooling for deterministic regression. For probabilistic regression we have basically nothing. |
|