|
|
|
|
|
by tm365
30 days ago
|
|
Some, like TerminalBench-2.0, requires web access for some tasks. If agents are expected to be use the web as a tool productively, which is a very useful SWE skill, they should be evaluated with that setting. Otherwise you risk behavior drift from the agent you are actually shipping |
|