Hacker News new | ask | show | jobs
by dkislyuk 803 days ago
Yes, exactly. ViTs need O(100M)-O(1B) images to overcome the lack of spatial priors. In that regime and beyond, they begin to generalize better than ConvNets.

Unfortunately, ImageNet is not a useful benchmark for a while now since pre-training is so important for production visual foundation models.