Hacker News new | ask | show | jobs
by tm365 30 days ago
Some, like TerminalBench-2.0, requires web access for some tasks.

If agents are expected to be use the web as a tool productively, which is a very useful SWE skill, they should be evaluated with that setting. Otherwise you risk behavior drift from the agent you are actually shipping