|
|
|
|
|
by exe34
496 days ago
|
|
> CNN will beat ViT on small data tasks, but that flips with enough scale because ViT imposes less inductive bias any idea why this is the case? CNN have the bias that neighbouring pixels are somehow relevant - they are neighbours. ViTs have to re-learn this from scratch. So why do they end up doing better than CNN? |
|