| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jncfhnb 783 days ago

There’s some enormous caveats to this.

The model architecture is 100% the thing that makes LLMs special. You would not get this doing token prediction with word2vec.

The model sizes are also hugely important. Adding billions of parameters does introduce the capability to fit to new features.

The models eventually reach saturation of how much they can fit to. There’s reason to believe that current LLMs are underfit to what their sizes could theoretically utilize, but it could also be that the optimization algorithms are simply not capable of easily and efficiently utilizing another 2x data to fill out the space. Doubling the model size, on the same training data, and letting it be even more underfit could result in a better model.