|
|
|
|
|
by dkislyuk
803 days ago
|
|
Yes, exactly. ViTs need O(100M)-O(1B) images to overcome the lack of spatial priors. In that regime and beyond, they begin to generalize better than ConvNets. Unfortunately, ImageNet is not a useful benchmark for a while now since pre-training is so important for production visual foundation models. |
|