Hacker News new | ask | show | jobs
by mgrund 30 days ago
I was under the impression that swe-bench (and I guess most other benchmarks) were supposed to be run offline?

I get that you may accidentally include something in local git history, but it feels off to me to run these kinds of benchmarks online.

3 comments

The article has this to say:

> Blocking web access outright would indeed prevent this, but isn’t possible as many benchmarks do require network access to download resources and hit relevant APIs to solve the task - the example above requires a video download from YouTube. Even if this weren’t the case, searching the web for context is a vital agent capability, so blocking it would stray from the downstream agent experience we wish to measure.

swe-bench is a standardized evaluation suite so that's why I'm asking - hopefully there are well-defined criteria on whether this is an open/closed book benchmark.

As I understand it, it is designed to evaluate the LM itself and not agentic systems with online access (very high likelihood of unintentional cheating/solution leaking). The paper and docs are not super clear on the concrete requirements (although reproducibility is emphasized which goes against online access). So I was hoping for someone with more familiarity to chip in.

Obviously not a problem for internal evaluations, but for fair scoreboard submissions it matters. It's not a matter of whether internet searches are useful, but rather what the benchmark is intended to benchmark.

Some, like TerminalBench-2.0, requires web access for some tasks.

If agents are expected to be use the web as a tool productively, which is a very useful SWE skill, they should be evaluated with that setting. Otherwise you risk behavior drift from the agent you are actually shipping