|
|
|
|
|
by ai-christianson
443 days ago
|
|
> I would start by making the examples yourself initially What I'm doing right now is this: 1) I have X problem to solve using the coding agent.
2) I ask the agent to do X
3) I use my own brain: did the agent do it correctly?
If the agent did not do it correctly, I then ask: should the agent have been able to solve this? If so, I try to improve the agent so it's able to do that.The hardest part about automating this is #3 above --each evaluation is one-off and it would be hard to even formalize the evaluation. SWE bench, for example uses unit tests for this, and the agent is blind to the unit tests --so the agent has to make a red test (which it has never seen) go green. |
|