Given that 4.7 was a brand new model, trained from scratch with a unique architecture and tokenization scheme, I don't see the same pattern. It seems arbitrary.
It means for 4.7 they trained a new base model with different architecture, different pre-training data (later knowledge cutoff), and a new tokenizer.
Vs finetuning an existing model, which was the case for 4.6, and probably for 4.8.