Hacker News new | ask | show | jobs
by bonoboTP 6 days ago
CNNs are fine when trained with a good recipe. There are very few good studies comparing them with proper hyperparam search and all the training tricks applied consistently. Transformers are good but ViT vs CNN is not some settled issue. Transformers are more hyped and more popular with the tech enthusiasts who just read forums and news, but if you need stuff done, CNNs are still great.
2 comments

I agree, but since we're talking about imagine understanding with text output, clearly a CNN is unsuitable. My previous comment was overly reductive and CNNs can still be SoTA depending on your performance metrics. I spent the earlier part of my career training CNNs, and they are very pleasant to work with.
You can run a CNN and use the downsampled feature map the same way as patch tokens.
>Transformers are more hyped and more popular with the tech enthusiasts who just read forums and news, but if you need stuff done, CNNs are still great.

Vits are straight up more popular for ML research now, it's not just 'tech enthusiasts'.

There's a dearth of research properly comparing them.
I'm talking about research pushing state of the art in computer vision. Vits have 100% become more popular than CNNs in most CV research.
Yes but not based on rigorous comparison. I'm not saying ViT is bad. But it took over mainly because it's the shiny new thing. It very bandwagon-Y even among PhD students.
> There's no 'rigorous comparison' that puts CNNs over Vits

That’s not accurate. My team wrote a paper for school in which a resnet model out performed a ViT model of the same size on almost all metrics. These were smaller models, but depending on the use case that might be what you want.

Don't know if it's you (did you publish?). I read about something similar but it had its issies:

- Tuning hyperparameters to gain improvement on a dataset when you're constantly looking at the answers is pretty meaningless. It's basically testing on the training data.

- Eval on ImageNet1k alone (very small, useless for the real world) made me wonder if it wasn't just overfit to the training set. Would it perform better training on the datasets used for the foundation models ? I doubt it.

Well I'm not saying CNNs are bad or useless at any rate.

There's no 'rigorous comparison' that puts CNNs over Vits in quality and Vits unlocked more use cases easier than CNNs did. That's why they're more popular, not because it's 'bandwagon-y'.
What's the use case enabled vs running a ConvNeXt or EfficientNetV2 and using the resulting strided features as you would the resulting tokens of a ViT? I'm not saying that ViT is worse. Just saying that the scholarship around comparing them is very bad or nonexistent. You have to properly tune the hyperparam enters on both sides in a fair way, and use all the general modern training tricks also on the CNN side to make it fair.