|
|
|
|
|
by criemen
324 days ago
|
|
Running evals aren't the problem, the problem is acquiring or building a high-quality, non-contaminated dataset. https://arxiv.org/abs/2506.12286 makes a very compelling case that swebench (and in extension, anything that's based on public source code) is most likely overestimating your agents actual capabilities. |
|