Hacker News new | ask | show | jobs
by lucubratory 854 days ago
It's a diffusion transformer architecture, so no it doesn't piece together pieces of video. If you're familiar with denoising algorithms, the diffusion algorithm is essentially a semantically guided denoising algorithm. If you feed it pure noise so the only information it has is the semantic guiding, it will generate a video from that noise directly. I'm not sure exactly how the transformer part of the algorithm contributes, but my guess is that it's giving the denoiser the ability to not just look at adjacent pixels in 2D space, but across time through the attention mechanism. That's just a guess, though.
1 comments

Thank you, after reading your comment, I did some research and stumbled upon this, an explanation of how Sora works from Jim Fan:

https://www.reddit.com/r/LocalLLaMA/comments/1aspxox/explana...