Hacker News new | ask | show | jobs
by HarHarVeryFunny 1177 days ago
Any overly simple "it's just predicting next word" explanation is really missing the point. It seems more accurate to regard that just as the way they are trained, rather than characterizing what they are learning and therefore what they are doing when they are generating.

There are two ways of looking at this.

1) In order to predict next word probabilities correctly, you need to learn something about the input, and the better you want to get, the more you need to learn. For example, if you just learned part-of-speech categories for words (noun vs verb vs adverb, etc), and what usually follows what, then you would be doing better than chance.. If you want to do better than that they you need to learn the grammar of the underlying language(s).. If you want to do better than that then you start to need to learn the meaning of what is being discussed, etc, etc.

If you want to correctly predict what comes next after "with a board position of ..., Magnus Carlson might play", then you better have learned a whole lot about the meaning of the input!

The "predict next word" training objective and feedback provided doesn't itself limit what can be learned - that's up to the power of the model that is being trained, and evidentially large multi-layer transformers are exceptionally capable. Calling these huge transformers "LLMs" (large language models) is deceptive since beyond a certain scale they are certainly learning a whole lot more than language/grammar.

2) In the words of one of the OpenAI developers (Sutskever), what these models have really learnt is some type of "world model" modelling the underlying generative processes that produced the training data. So, they are not just using surface level statistics to "predict next word", but rather are using the (often very lengthy/detailed) input prompt to "get into the head" of what generated that, and are predicting on that basis.