|
|
|
|
|
by jonas21
546 days ago
|
|
If you want to train a model to have a general understanding of the physical world, one way is to show it videos and ask it to predict what comes next, and then evaluate it on how close it was to what actually came next. To really do well on this task, the model basically has to understand physics, and human anatomy, and all sorts of cultural things. So you're forcing the model to learn all these things about the world, but it's relatively easy to train because you can just collect a lot of videos and show the model parts of them -- you know what the next frame is, but the model doesn't. Along the way, this also creates a video generation model - but you can think of this as more of a nice side effect rather than the ultimate goal. |
|
All these models have just “seen” enough videos of all those things to build a probability distribution to predict the next step.
This is not bad, or make it inherently dumb, a major component of human intelligence is built on similar strategies. I couldn’t tell what grammatical rules are broken in text or what physical rules in a photograph but can tell it is wrong using the same methods .
Inference can take it far with large enough data sets, but sooner or later without reasoning you will hit a ceiling .
This is true for humans as well, plenty of people go far in life with just memorization and replication do a lot of jobs fairly competently, but not in everything.
Reasoning is essential for higher order functions and transformers is not the path for that