|
|
|
|
|
by ACCount37
279 days ago
|
|
"Anecdata" is notoriously unreliable when it comes to estimating AI performance over time. Sure, people complain about Anthropic's AI models getting worse over time. As well as OpenAI's models getting worse over time. But guess what? If you serve them open weights models, they also complain about models getting worse over time. Same exact checkpoint, same exact settings, same exact hardware. Relative LMArena metrics, however, are fairly consistent across time. The takeaway is that users are not reliable LLM evaluators. My hypothesis is that users have a "learning curve", and get better at spotting LLM mistakes over time - both overall and for a specific model checkpoint. Resulting in increasingly critical evaluations over time. |
|
Living evals can solve for the quantitative issues with infra and model updates, but not sure how to deal with perceptual adaptation.