does it just piece bits of videos together from the randomness of the internet to generate something? So it's a statistical match to some aggregate combination of videos?
It's a diffusion transformer architecture, so no it doesn't piece together pieces of video. If you're familiar with denoising algorithms, the diffusion algorithm is essentially a semantically guided denoising algorithm. If you feed it pure noise so the only information it has is the semantic guiding, it will generate a video from that noise directly. I'm not sure exactly how the transformer part of the algorithm contributes, but my guess is that it's giving the denoiser the ability to not just look at adjacent pixels in 2D space, but across time through the attention mechanism. That's just a guess, though.
Are you asking here how AI art works, in general? That would take more than fits in a comment, and there are lots of explainers online. You could even ask ChatGPT this question. But no, it's not piecing bits of videos together. It doesn't store enough of each individual piece of art to have a piece. The art might take up X bytes, X being a large chunk of the whole internet, but the AI is only a fraction of a percent of that. So it can't possibly store chunks of each piece of art. Even 1 pixel from each piece of art would be too much. But it does store the patterns the art had in common with each other. And from that, it generates.