Hacker News new | ask | show | jobs
by Amber-chen 52 days ago
I like the small-surface-area approach. The question I’d use to evaluate this is how well the harness records/replays tool calls and failure modes, since that is where debugging agent behavior usually gets messy.