|
|
|
|
|
by FairlyInvolved
1388 days ago
|
|
I think that's just a scaling issue, fundamentally there's no reason why a model trained on video couldn't come to create coherent motion in the same way that image models can now product coherent lighting/themes. Smaller image models had the same problems with logical inconsistency just because they didn't have sufficient general understanding of how visual concepts. The same is almost certainly true of video - early smaller models will likely create janky movements/motion, however once they've seen enough video to understand how a person walks, how a scene is framed etc.. there's no reason we couldn't get to the same level of maturity as today's image models. I think the real issue will come from labelling - most video is only going to be labelled simply with basic info/captions without detailed descriptions of the camera pan, movement of subjects. The amount of text required to accurately describe a scene is much larger than a still image and I'm not sure how once would go about collecting this. |
|