|
|
|
|
|
by sailingparrot
133 days ago
|
|
> you don't need to make a video model. You probably don't need to decode the latents at all. If you don't decode, how do you judge quality in a world where generative metrics are famously very hard and imprecise?
How do you go about integrating RLHF/RLAF in your pipeline if you don't decode, which is not something you can skip anymore to get SotA? Just look at the companies that are explicitly aiming for robotics/simulation, they *are* doing video models. |
|