| Logged into my personal account for this one! I'm a lead author on a paper that explored exactly. It does enable faster training and smaller model sizes. For reference, you can get 80% accuracy on CIFAR-10 in ~30 minutes of CPU (not using crazy optimizations). There are open questions about scaling but at the time we did not have access to big compute (really still don't) and our goals were focused on addressing the original ViT's claims of data constraints and necessities of pretraining for smaller datasets (spoiler, augmentation + overlapping patches plays a huge role). Basically we wanted to make a network that allowed people to train transformers from scratch for their data projects because pretrained models aren't always the best solutions or practical. Paper: https://arxiv.org/abs/2104.05704 Blog: https://medium.com/pytorch/training-compact-transformers-fro... CPU compute: https://twitter.com/WaltonStevenj/status/1382045610283397120 Crazy optimizations (no affiliation): 94% on CIFAR-10 in <6.3 seconds on a single A100 : https://github.com/tysam-code/hlb-CIFAR10 I also want to give maybe some better information about ViTs in general. Lucas Beyer is a good source and has some lectures as well as Hila Chefer and Sayak Paul's tutorials. Also, just follow Ross Wightman, the man is a beast Lucas Beyer: https://twitter.com/giffmana/status/1570152923233144832 Chefer & Paul's All Things ViT: https://all-things-vits.github.io/atv/ Ross Wightman : https://twitter.com/wightmanr His very famous timm package https://github.com/huggingface/pytorch-image-models |