Y
Hacker News
new
|
ask
|
show
|
jobs
by
XCSme
52 days ago
They mentioned in their release page, that the Claude team noticed memorization of the SWE-bench test, so the test is actually in the training data.
Here:
https://www.anthropic.com/news/claude-opus-4-7#:~:text=memor...
2 comments
sigmoid10
52 days ago
Any static benchmark older than 12-18 months is basically worthless, because the content will have spread all over the internet and have found its way into the latest model's training set.
link
William_BB
52 days ago
Good luck arguing with SWE benchmark purists
link