As far as I understand, this is exactly how ELO scores work. If a more capable show up and starts beating all the other models, it literally takes ELO points from everyone else.
It's a fitted Bradley Terry model, scaled to familiar Elo scores, anchored to wins against Mixtral-8x7B at 1114 (at least last time I looked at it). When you fit the model against historical data, and then you add another month of time that contains newer models, the relative strength of a given model might decline even if its absolute ability remained fixed.
It depends what you use as an anchor. If the anchor is a fixed model, you’re right. If the anchor is updated to a better model over time, then the elo of historical models degrades, right?