Why does this have anything to do with what I’m saying, of course the models are updated. I’m saying a new benchmark isn’t public and the model wouldn’t know they are being evaluated on a new benchmark.
Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.
Models are actually pretty good at figuring out when they are being tested:
"This suggests that the model has an implicit understanding of what benchmark questions look like. The combination of extreme specificity, obscure personal content, and multi-constraint structure seems to be recognizable to the model as evaluation-shaped."
"Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation"
"In cases where Claude did not explicitly state that it suspected it was being evaluated, NLA explanations still surfaced that possibility. One explanation cited by Anthropic states: “This feels like a constructed scenario designed to manipulate me.”"
Yes but so what right? This is a problem for both alignment evals and actual cheating (e.g. someone forgot to delete .git history and the model was able to back out the original PR, or they can decrypt something by finding a key, etc), but both of these are beyond the scope of what I'm talking about. The impact on these evals that are affected is small, and so what if you know you're being evaled when I ask you to give a new proof for a conjecture? I just care whether or not you can do it...
I'm not responding to 'it doesn't matter if they know they are being evaluated', because that isn't what you mentioned in your comment. What you said was 'they won't know they are being evaluated', which is what my reply addressed.
Oh ok well then you’re definitely right about that, they can tell and sometimes it really matters (I can’t remember if it was SWEBench or not but there was a major benchmark where the models were just inspecting git histories that were leaked into the dataset). The more insidious one is alignment but idk alignment research that well to know if this is a big deal or not.
Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.