Hacker News new | ask | show | jobs
by mgrund 29 days ago
swe-bench is a standardized evaluation suite so that's why I'm asking - hopefully there are well-defined criteria on whether this is an open/closed book benchmark.

As I understand it, it is designed to evaluate the LM itself and not agentic systems with online access (very high likelihood of unintentional cheating/solution leaking). The paper and docs are not super clear on the concrete requirements (although reproducibility is emphasized which goes against online access). So I was hoping for someone with more familiarity to chip in.

Obviously not a problem for internal evaluations, but for fair scoreboard submissions it matters. It's not a matter of whether internet searches are useful, but rather what the benchmark is intended to benchmark.