| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by underyx 38 days ago
	> the slow performance decays the decays are just more capable other models entering the population, making all prior models lose more frequently

1 comments

TekMol 38 days ago

No, that is not how ELO scores work.

link

qnleigh 38 days ago

As far as I understand, this is exactly how ELO scores work. If a more capable show up and starts beating all the other models, it literally takes ELO points from everyone else.

https://en.wikipedia.org/wiki/Elo_rating_system

link

TekMol 38 days ago

    If a more capable show up and starts
    beating all the other models

There is an instance of this in the chart. In 2025-06-24 when Gemini-2.5-pro shows up. As you can see, the ELO of the others do not drop.

link

harperlee 38 days ago

Depends on the test design; is an agent competing against other agent in a given match, or against a test? Plus! Does the test's ELO fluctuate?

link

bitshiftfaced 38 days ago

It's a fitted Bradley Terry model, scaled to familiar Elo scores, anchored to wins against Mixtral-8x7B at 1114 (at least last time I looked at it). When you fit the model against historical data, and then you add another month of time that contains newer models, the relative strength of a given model might decline even if its absolute ability remained fixed.

link

tasuki 38 days ago

Yes, that is in fact how Elo can work[0]. There are quite many ways Elo systems can work.

[0]: https://en.wikipedia.org/wiki/Elo_rating_system

link

whiplash451 38 days ago

It depends what you use as an anchor. If the anchor is a fixed model, you’re right. If the anchor is updated to a better model over time, then the elo of historical models degrades, right?

link