| HN Mirror

It's a diffusion transformer architecture, so no it doesn't piece together pieces of video. If you're familiar with denoising algorithms, the diffusion algorithm is essentially a semantically guided denoising algorithm. If you feed it pure noise so the only information it has is the semantic guiding, it will generate a video from that noise directly. I'm not sure exactly how the transformer part of the algorithm contributes, but my guess is that it's giving the denoiser the ability to not just look at adjacent pixels in 2D space, but across time through the attention mechanism. That's just a guess, though.