| HN Mirror

The per-agent-green / merge-broken pattern is the diagonal failure mode of multi-agent systems. Unit testing each agent in isolation captures correctness within scope; what's invisible is the seam at handoff — argument schemas drifting between coder and overseer, response shapes that satisfy each agent's local validator but break the next's parser, error messages that get summarized into "no error" by the time they reach the orchestrator.

  Built tool-call-grader to instrument exactly this. Session-level statistics across the tool-call trace plus six pathology detectors (silent failure, tool fixation, response bloat, schema drift, irrelevant response, cascading failure). On a hand-designed multi-agent benchmark, 7/7 scenarios passed — including specifically the case you're describing:
  per-agent results look fine, schema-drift fires at the seam.  
  The detector runs over the trace, not the output. Catches the failure several turns before it shows up as "weird merge bug" the human has to debug. MIT licensed, npx-installable. Methodology in profile.