Hacker News new | ask | show | jobs
by hackerlight 818 days ago
I don't get how transformers can replace convolutional networks. My understanding is patches get fed in, and the transformer will do the same thing that a convolution layer does. But transformers deal with sequential data and I don't see any of that here?
3 comments

Transformers are not limited to sequential data. They can process any form of data you can tokenize, as long as they have enough of it to learn the patterns and structures it contains.
A transformers model was probably chosen because of its scaling properties and because it's easy to mask the attention layer so that you can fit in the same batch multiple video and images of different lengths and dimensions, this is important because every batch needs to be the same size to fit in the same mini-batch and for performance reasons.
From the fine article:

> Now for the second point, both DiT and Sora replace the commonly-used U-Net architecture with a vanilla Transformer architecture. This matters because the authors of the DiT paper observe that using Transformers leads to predictable scaling: As you apply more training compute (either by training the model for longer or making the model larger, or both), you obtain better performance. The Sora technical report notes the same but for videos and includes a useful illustration.

> This scaling behavior, which can be quantified by so-called scaling laws, is an important property and it has been studied before in the context of Large Language Models (LLMs) and for autoregressive models on other modalities. The ability to apply scale to obtain better models was one of the key drivers behind the rapid progress on LLMs. Since the same property exists for image and video generation, we should expect the same scaling recipe to work here, too.

I think it just treats the patches like it would be sequentially in memory or disk, but also has coordinates. And they have overlapping patches at an offset to catch features that would span a patch and be missed at that level.