Hacker News new | ask | show | jobs
by thisisfatih 56 days ago
The refactor-per-cycle fix lands in the right place. The harder problem shows up when EvanFlow forks into parallel coder/overseer mode: unit tests pass per agent, but the seams break at merge. Your note that "integration tests at touchpoints ARE the cohesion contract" is exactly right, but enforcement is what makes it stick. Each parallel branch needs its own failing test that can't be masked by another branch's green run. Worktree isolation handles this cleanly since each agent's environment is separate. Without that, vertical-slice TDD in parallel collapses to "tests pass somewhere."

On jtfrench's unanswered question about dumb zone evasion: context length is what drives the drift. Agents go off-track when a loop runs long enough that early design context falls out. Resetting at each RED-GREEN-REFACTOR boundary keeps cycles short enough to avoid it. The hard cap of 5 iterate rounds is the same instinct applied at the macro level.

We ran into the parallel integration seam problem building tonone, a 23-agent Claude Code plugin where each domain agent works in its own worktree and integration tests are the merge contract.

https://github.com/tonone-ai/tonone if curious.

1 comments

The per-agent-green / merge-broken pattern is the diagonal failure mode of multi-agent systems. Unit testing each agent in isolation captures correctness within scope; what's invisible is the seam at handoff — argument schemas drifting between coder and overseer, response shapes that satisfy each agent's local validator but break the next's parser, error messages that get summarized into "no error" by the time they reach the orchestrator.

  Built tool-call-grader to instrument exactly this. Session-level statistics across the tool-call trace plus six pathology detectors (silent failure, tool fixation, response bloat, schema drift, irrelevant response, cascading failure). On a hand-designed multi-agent benchmark, 7/7 scenarios passed — including specifically the case you're describing:
  per-agent results look fine, schema-drift fires at the seam.  
  The detector runs over the trace, not the output. Catches the failure several turns before it shows up as "weird merge bug" the human has to debug. MIT licensed, npx-installable. Methodology in profile.