Yes but not based on rigorous comparison. I'm not saying ViT is bad. But it took over mainly because it's the shiny new thing. It very bandwagon-Y even among PhD students.
> There's no 'rigorous comparison' that puts CNNs over Vits
That’s not accurate. My team wrote a paper for school in which a resnet model out performed a ViT model of the same size on almost all metrics. These were smaller models, but depending on the use case that might be what you want.
Don't know if it's you (did you publish?). I read about something similar but it had its issies:
- Tuning hyperparameters to gain improvement on a dataset when you're constantly looking at the answers is pretty meaningless. It's basically testing on the training data.
- Eval on ImageNet1k alone (very small, useless for the real world) made me wonder if it wasn't just overfit to the training set. Would it perform better training on the datasets used for the foundation models ? I doubt it.
Well I'm not saying CNNs are bad or useless at any rate.
Exactly. Most of the comparison papers are useless. This is hard stuff, only few people have the chops it takes to even attempt this. You can of course train some models and then post the numbers, that's not the hard part.
There's no 'rigorous comparison' that puts CNNs over Vits in quality and Vits unlocked more use cases easier than CNNs did. That's why they're more popular, not because it's 'bandwagon-y'.
What's the use case enabled vs running a ConvNeXt or EfficientNetV2 and using the resulting strided features as you would the resulting tokens of a ViT? I'm not saying that ViT is worse. Just saying that the scholarship around comparing them is very bad or nonexistent. You have to properly tune the hyperparam enters on both sides in a fair way, and use all the general modern training tricks also on the CNN side to make it fair.
That’s not accurate. My team wrote a paper for school in which a resnet model out performed a ViT model of the same size on almost all metrics. These were smaller models, but depending on the use case that might be what you want.