|
|
|
|
|
by bob1029
40 days ago
|
|
The relative and auto-scaling nature of Elo ranking feels like an advantage here. Relative ranking systems extract more information per tournament. You will get something approximating the actual latent skill level with enough of them. |
|
New models are on average better than older models, the average skill of the population of models increases over time and so you are mathematically guaranteed that any existing model will over time degrade in Elo score even though it didn't change itself in any way.
It's like benchmarking a model against a list of challenges that over time are made more and more difficult and then claiming the model got nerfed because its score declined.
Elo is good at establishing an overall ranking order across models but that's not what this is about.
To detect nerfing of a model, projects like https://marginlab.ai/trackers/claude-code/ are much much better (I'm not affiliated in any way).