|
|
|
|
|
by djoldman
60 days ago
|
|
> We have incorporated these findings into our recent evaluation efforts. In the last months we’ve chosen to report results from the public split of SWE-Bench Pro. We recommend other model developers do the same. SWE-bench Pro is not perfect, but empirically seems to suffer less from contamination issues. https://arxiv.org/pdf/2509.16941 |
|