Hacker News new | ask | show | jobs
by FairlyInvolved 1388 days ago
I think that's just a scaling issue, fundamentally there's no reason why a model trained on video couldn't come to create coherent motion in the same way that image models can now product coherent lighting/themes.

Smaller image models had the same problems with logical inconsistency just because they didn't have sufficient general understanding of how visual concepts.

The same is almost certainly true of video - early smaller models will likely create janky movements/motion, however once they've seen enough video to understand how a person walks, how a scene is framed etc.. there's no reason we couldn't get to the same level of maturity as today's image models.

I think the real issue will come from labelling - most video is only going to be labelled simply with basic info/captions without detailed descriptions of the camera pan, movement of subjects. The amount of text required to accurately describe a scene is much larger than a still image and I'm not sure how once would go about collecting this.