Hacker News new | ask | show | jobs
by crypto420 2 days ago
CNNs excel in vision tasks where you have limited compute, limited memory, limited data, and want something that works super well and quick. People usually don't hook CNNs up to a transformer to get language understanding either, you have to train bespoke CNNs for specific tasks

ViTs excel where you're unbounded in compute + data and also want text understanding or have a conversation about an image

1 comments

These are vibes. ViT has been shown to work fine on small data with proper hyperparam and most of what you mention is actually doable just fine with the other architecture as well.