| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dkislyuk 803 days ago
	Yes, exactly. ViTs need O(100M)-O(1B) images to overcome the lack of spatial priors. In that regime and beyond, they begin to generalize better than ConvNets. Unfortunately, ImageNet is not a useful benchmark for a while now since pre-training is so important for production visual foundation models.