| I think people might be missing what this enables. It can make plausible continuations of video, with realistic physics. What happens if this gets fast enough to work _in real time_. Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI. You can probably already imagine different ways to wire the output to text generation and controlling its own motions, etc, and predicting outcomes based on actions it, itself could plausibly take, and choosing the best one. It doesn't actually have to generate realistic imagery or imagery that doesn't have any mistakes or imagery that's high definition to be used in that way. How realistic is our own imagination of the world? Edit: I'm going to add a specific case. Imagine a house cleaning robot. It starts with an image of your living room. Then it creates a image of your living room after it's been cleaned. Then it interpolates a video _imagining itself cleaning the room_, then acts as much as it can to mimic what's in the video, then generates a new continuation, then acts, and so on. Imagine doing that several times a second, if necessary. |
Check out V-Jepa for such a system: https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-jo...