Hacker News new | ask | show | jobs
by rocauc 808 days ago
Pulling out a key part of this post from a DeepMind 2023 paper[1]: “Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs outperform pre-trained ConvNets when evaluated fairly.”

Another common constraint in vision vs language is the long tails are very long in the visual world. There's a number of domains where you have very little examples to learn (defects are designed to happen infrequently; rare species for identification show up, well, rarely). And pulling from the blog: "But small models ... benefit greatly from the exact type experiment of outlined in this post: strong augmentation with limited data trained across many epochs."

[1] https://arxiv.org/pdf/2310.16764.pdf