Hacker News new | ask | show | jobs
by kungfupawnda 854 days ago
I was wondering on the internals of Sora
1 comments

It's a diffusion transformer architecture, so no it doesn't piece together pieces of video. If you're familiar with denoising algorithms, the diffusion algorithm is essentially a semantically guided denoising algorithm. If you feed it pure noise so the only information it has is the semantic guiding, it will generate a video from that noise directly. I'm not sure exactly how the transformer part of the algorithm contributes, but my guess is that it's giving the denoiser the ability to not just look at adjacent pixels in 2D space, but across time through the attention mechanism. That's just a guess, though.
Thank you, after reading your comment, I did some research and stumbled upon this, an explanation of how Sora works from Jim Fan:

https://www.reddit.com/r/LocalLLaMA/comments/1aspxox/explana...