|
|
|
|
|
by mgrund
30 days ago
|
|
I was under the impression that swe-bench (and I guess most other benchmarks) were supposed to be run offline? I get that you may accidentally include something in local git history, but it feels off to me to run these kinds of benchmarks online. |
|
> Blocking web access outright would indeed prevent this, but isn’t possible as many benchmarks do require network access to download resources and hit relevant APIs to solve the task - the example above requires a video download from YouTube. Even if this weren’t the case, searching the web for context is a vital agent capability, so blocking it would stray from the downstream agent experience we wish to measure.