Does the training data budget scale with model size?
How would you compare the Gemma 4 draft model which is also integrated with the base kv cache?