|
|
|
|
|
by lnrd
62 days ago
|
|
I'm honestly confused by the design of SWE-bench and why is considered reliable. It's based on existing GitHub PRs and Issues, the full dataset is on HuggingFace and is one year old now. All frontier models 100% have those issues and PRs in their training data so obviously they are good at reproducing fixes for them when confronted with the same codebase and similar requests. Am I missing something? How is this considered the most reliable benchmark? |
|