| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by corimaith 3 days ago
	Or you could just use a CNN...

3 comments

bigmadshoe 3 days ago

CNNs are not SoTA anymore when it comes to large models, and also are not used to provide interpretations of images as text, but rather to classify, do semantic segmentation, etc.

link

bonoboTP 2 days ago

CNNs are fine when trained with a good recipe. There are very few good studies comparing them with proper hyperparam search and all the training tricks applied consistently. Transformers are good but ViT vs CNN is not some settled issue. Transformers are more hyped and more popular with the tech enthusiasts who just read forums and news, but if you need stuff done, CNNs are still great.

link

bigmadshoe 2 days ago

I agree, but since we're talking about imagine understanding with text output, clearly a CNN is unsuitable. My previous comment was overly reductive and CNNs can still be SoTA depending on your performance metrics. I spent the earlier part of my career training CNNs, and they are very pleasant to work with.

link

bonoboTP 2 days ago

You can run a CNN and use the downsampled feature map the same way as patch tokens.

link

famouswaffles 2 days ago

>Transformers are more hyped and more popular with the tech enthusiasts who just read forums and news, but if you need stuff done, CNNs are still great.

Vits are straight up more popular for ML research now, it's not just 'tech enthusiasts'.

link

bonoboTP 2 days ago

There's a dearth of research properly comparing them.

link

famouswaffles 2 days ago

I'm talking about research pushing state of the art in computer vision. Vits have 100% become more popular than CNNs in most CV research.

link

bonoboTP 2 days ago

Yes but not based on rigorous comparison. I'm not saying ViT is bad. But it took over mainly because it's the shiny new thing. It very bandwagon-Y even among PhD students.

link

tehjoker 3 days ago

Can you say more about that? I haven't kept up.

link

crypto420 2 days ago

CNNs excel in vision tasks where you have limited compute, limited memory, limited data, and want something that works super well and quick. People usually don't hook CNNs up to a transformer to get language understanding either, you have to train bespoke CNNs for specific tasks

ViTs excel where you're unbounded in compute + data and also want text understanding or have a conversation about an image

link

bonoboTP 2 days ago

These are vibes. ViT has been shown to work fine on small data with proper hyperparam and most of what you mention is actually doable just fine with the other architecture as well.

link

Jabrov 3 days ago

Transformers are superior

link

nullstyle 3 days ago

Which?

link