| HN Mirror

Lots of people transfer learn with transformers. ViT[0] originally did CIFAR with it. Then DeiT[1] introduced some knowledge transfer (note: their student is larger than the teacher). ViT pretrained on both ImageNet21k and JFT-300m.

CCT ([1] from above) was focused on training from scratch.

There's two paradigms to be aware of. ImageNet and pre-training can often be beneficial but it doesn't always help. It really depends on the problem you're trying to tackle and if there are similar features within the target dataset and the pre-trained dataset. If there is low similarity you might as well train from scratch. Also, you might not want as large of models (like ViT and DeiT have, which ViT's has more parameters than CIFAR-10 has features).

Disclosure: Author on CCT

[0] https://arxiv.org/abs/2010.11929

[1] https://arxiv.org/abs/2012.12877