|
|
|
|
|
by LarsDu88
176 days ago
|
|
I feel like there's a bit if a disconnect with the cool video demos demonstrated here and say, the type of world models someone like Yann Lecunn is talking about. A proper world model like Jepa should be predicting in latent space where the representation of what is going on is highly abstract. Video generation models by definition are either predicting in noise or pixel space (latent noise if the diffuser is diffusing in a variational encoders latent space) It seems like what this lab is doing is quite vanilla, and I'm wondering if they are doing any sort of research in less demo sexy joint embedding predictive spaces. There was a recent paper, LeJepa from LeCunn and a postdoc that actually fixes many of the mode distribution collapse issues with the Jepa embedding models I just mentioned. I'm waiting on the startup or research group that gives us an unsexy world model. Instead of giving us 1080p video of supermodels camping, gives us a slideshow of something a 6 year old child would draw. That would be a more convincing demonstrator of an effective world model. |
|
I don't see that this follows "by definition" at all.
Just because your output is pixel values doesn't mean your internal world model is in pixel space.