|
|
|
|
|
by Eisenstein
311 days ago
|
|
Why does 'next-word prediction' explain why huge models work? You saying we needed scale, and saying we use next-word prediction, but how does one relate to the other? Diffusion models also exist and work well for images, and they do seem to work for LLMs too. |
|
The leap is in transforming an ill-defined objective of "modeling intelligence" into a concrete proxy objective. Note that the task isn't even "distribution set of valid/true things", since validity/truth is hard to define. It's something akin to "distribution of things a human might say" implemented in the "dumbest" possible way of "modeling the distribution of humanity's collective textual output".