A lot of cool things are shot down by "it requires more compute, and by a lot, and we're already compute starved on any day of the week that ends in y, so, not worth it".
If we had a million times the compute? We might have brute forced our way to AGI by now.
It's kind of a shortcut answer by now. Especially for anything that touches pretraining.
"Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.
The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.
A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.
This has a ton of seemingly random assumptions, why can't we compress multiple latent space representations into one? Even in simple tokenizers token "and" has no right being the same size as "scientist".
If we had a million times the compute? We might have brute forced our way to AGI by now.