| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hackerlight 818 days ago
	I don't get how transformers can replace convolutional networks. My understanding is patches get fed in, and the transformer will do the same thing that a convolution layer does. But transformers deal with sequential data and I don't see any of that here?

3 comments

Legend2440 818 days ago

Transformers are not limited to sequential data. They can process any form of data you can tokenize, as long as they have enough of it to learn the patterns and structures it contains.

link

GaggiX 818 days ago

A transformers model was probably chosen because of its scaling properties and because it's easy to mask the attention layer so that you can fit in the same batch multiple video and images of different lengths and dimensions, this is important because every batch needs to be the same size to fit in the same mini-batch and for performance reasons.

link

striking 818 days ago

From the fine article:

> Now for the second point, both DiT and Sora replace the commonly-used U-Net architecture with a vanilla Transformer architecture. This matters because the authors of the DiT paper observe that using Transformers leads to predictable scaling: As you apply more training compute (either by training the model for longer or making the model larger, or both), you obtain better performance. The Sora technical report notes the same but for videos and includes a useful illustration.

> This scaling behavior, which can be quantified by so-called scaling laws, is an important property and it has been studied before in the context of Large Language Models (LLMs) and for autoregressive models on other modalities. The ability to apply scale to obtain better models was one of the key drivers behind the rapid progress on LLMs. Since the same property exists for image and video generation, we should expect the same scaling recipe to work here, too.

link

cma 818 days ago

I think it just treats the patches like it would be sequentially in memory or disk, but also has coordinates. And they have overlapping patches at an offset to catch features that would span a patch and be missed at that level.

link