|
|
|
|
|
by peterjliu
443 days ago
|
|
I would start by making the examples yourself initially, assuming you have a good sense for what that real-world task is. If you can't articulate what a good task is and what a good output is, it is not ready for out-sourcing to crowd-workers. And before going to crowd-workers (maybe you can skip them entirely) try LLMs. |
|
What I'm doing right now is this:
If the agent did not do it correctly, I then ask: should the agent have been able to solve this? If so, I try to improve the agent so it's able to do that.The hardest part about automating this is #3 above --each evaluation is one-off and it would be hard to even formalize the evaluation.
SWE bench, for example uses unit tests for this, and the agent is blind to the unit tests --so the agent has to make a red test (which it has never seen) go green.