|
|
|
|
|
by GaggiX
818 days ago
|
|
A transformers model was probably chosen because of its scaling properties and because it's easy to mask the attention layer so that you can fit in the same batch multiple video and images of different lengths and dimensions, this is important because every batch needs to be the same size to fit in the same mini-batch and for performance reasons. |
|
> Now for the second point, both DiT and Sora replace the commonly-used U-Net architecture with a vanilla Transformer architecture. This matters because the authors of the DiT paper observe that using Transformers leads to predictable scaling: As you apply more training compute (either by training the model for longer or making the model larger, or both), you obtain better performance. The Sora technical report notes the same but for videos and includes a useful illustration.
> This scaling behavior, which can be quantified by so-called scaling laws, is an important property and it has been studied before in the context of Large Language Models (LLMs) and for autoregressive models on other modalities. The ability to apply scale to obtain better models was one of the key drivers behind the rapid progress on LLMs. Since the same property exists for image and video generation, we should expect the same scaling recipe to work here, too.