| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by crypto420 2 days ago
	CNNs excel in vision tasks where you have limited compute, limited memory, limited data, and want something that works super well and quick. People usually don't hook CNNs up to a transformer to get language understanding either, you have to train bespoke CNNs for specific tasks ViTs excel where you're unbounded in compute + data and also want text understanding or have a conversation about an image

1 comments

bonoboTP 2 days ago

These are vibes. ViT has been shown to work fine on small data with proper hyperparam and most of what you mention is actually doable just fine with the other architecture as well.

link