|
|
|
|
|
by mgrund
29 days ago
|
|
swe-bench is a standardized evaluation suite so that's why I'm asking - hopefully there are well-defined criteria on whether this is an open/closed book benchmark. As I understand it, it is designed to evaluate the LM itself and not agentic systems with online access (very high likelihood of unintentional cheating/solution leaking). The paper and docs are not super clear on the concrete requirements (although reproducibility is emphasized which goes against online access). So I was hoping for someone with more familiarity to chip in. Obviously not a problem for internal evaluations, but for fair scoreboard submissions it matters. It's not a matter of whether internet searches are useful, but rather what the benchmark is intended to benchmark. |
|