|
|
|
|
|
by jstanley
178 days ago
|
|
> Video generation models by definition are either predicting in noise or pixel space I don't see that this follows "by definition" at all. Just because your output is pixel values doesn't mean your internal world model is in pixel space. |
|
In either case the impressiveness of that decoder can be far removed from the effectiveness of your world model or involve no world model at all