|
|
|
|
|
by esperent
5 days ago
|
|
From that paper: > This raises a central
question: do such tests meaningfully improve issue resolution, or
do they mainly mimic a familiar software-development practice
while consuming interaction budget? This is an important question but it's not the one I'm most interested in when requiring agents to follow TDD. My goal is to lock in behavior because it was happening way too frequently that an agent would successfully fix the issue at hand, but break something else that it wasn't supposed to touch. The tests add another layer and it's why I always separate out red and green worker subagents. The green worker might get trigger happy and go beyond scope/break something but it's not allowed to fudge the tests so I'll know and can clean up and revert. It's also why I'm not too bothered about perfect red green TDD. I can add the tests later if needed. |
|
I've been finding enforcing integrations and behavior structurally (e.g., through codegen/schemagen, e2e tests, etc) more reliable than simply instructing the models to write tests. oftentimes these tests are pretty low quality anyway, and results in its own form of tech debt.