| HN Mirror

Indeed, there are even multiple attempts to use both self-attention and convolutions in novel architectures, and there is evidence this works very well and may have significant advantages over pure vision transformer models [1-2].

IMO there is little reason to think transformers are (even today) the best architecture for any deep learning application. Perhaps if a mega-corp poured all their resources into some convolutional transformer architecture, you'd get something better than just the current vision transformer (ViT) models, but, since so much optimizations and work on the training of ViTs has been done, and since we clearly still haven't maxed out their capacity, it makes sense to stick with them at scale.

That being said, ViTs are still currently clearly the best if you want something trained on a near-entire-internet of image or video data.

[1] https://arxiv.org/abs/2103.15808

[2] https://scholar.google.ca/scholar?hl=en&as_sdt=0%2C5&q=convo...