Generating videos and 3D models is _much_ more difficult than images. You can’t just train off videos from the internet in the same way, because they don’t have sufficient text labels to understand them like CLIP does.
Oh but they have sound which can be annotated much faster/more efficiently. You also potentially have screenplay but the amount of training data is probably too less and sparse.
FWIW, I don’t think the AI systems will generate a whole video by itself - it’ll be some form of image to image generation where an artist will render a rough sketch of the scene and the AI will fill in the details, frame by frame.
I wonder if subtitles could be used, so rather than describing the video, you just write a script and it generates video for you. I'm certainly no expert, but it does seem like there's a lot more data there.
FWIW, I don’t think the AI systems will generate a whole video by itself - it’ll be some form of image to image generation where an artist will render a rough sketch of the scene and the AI will fill in the details, frame by frame.