|
|
|
|
|
by mbesto
270 days ago
|
|
> SWE-bench, and many other AI benchmarks, have lots of eval noise SWE-bench has lots of known limitations even with its ability to reduce solution leakage and overfitting. > where there is no clear right answer This is both a feature and a bug. If there is no clear answer then how do you determine whether an LLM has progressed? It can't simply be judged on making "more right answers" on each release. |
|