Doesn't this "silent degredation" prevent any actual evaluation of the model? If the model fails at something, this allows anyone to claim that it failed due to degradation.
Who cares if it can be evaluated independently? The majority of commenters on HN were happy to vibe code and ship products with the models we had 1-2 years ago. It continues to be laughable.
I understand that moving the goalpost every release is unfair, but it's similarly concerning to consider that people were letting GPT 4.X vibe code and ship entire products.
I understand that moving the goalpost every release is unfair, but it's similarly concerning to consider that people were letting GPT 4.X vibe code and ship entire products.