|
|
|
|
|
by saeranv
1569 days ago
|
|
Are transformers competitive with (for example) CNNs on vision-related tasks when there's less data available? I'm not that familiar with "injecting inductive bias" via positional encodings, but it sounds really interesting. My crude understanding is that the positional encodings were used in the original Transformer architecture to encode the ordering of words for NLP. Are they more flexible then that? For example, can they be used to replicate the image-related inductive bias of CNNs and match CNN performance on small datasets (1000 - 10,000)? If not, then to me, it seems like only industries where it's possible to get access to a large amount of representative data (i.e. greater than a million?) benefit from transformers. In industries where there are bottlenecks to data generation, there's a clear benefit in leveraging the inductive bias in other architectures, such as the various ways CNNs have biases towards image recognition. I'm in an industry (building energy consumption prediction) where we can only generate around 10,000 to 100,000 datapoints (from simulation engines) for DL. Are transformers ever used with that scale of data? |
|
They can be, there's current research into the tradeoffs between local inductive bias (information from local receptive fields: CNNs have strong local inductive bias) and global inductive bias (large receptive fields: i.e. attention). There's plenty of works that combine CNNs and Attention/Transformers. A handful of them focus on smaller datasets, but the majority are more interested in ImageNet. There's also work being done to change the receptive fields within attention mechanisms as a means to balance this.
> Are transformers ever used with that scale of data?
So there's a yes and no to your question. But definitely yes since people have done work on Flowers102 (6.5k training) and CIFAR10 (50k training). Keep in mind that not all these models are pure transformers. Some have early convolutions or intermediate ones. Some of these works even have a smaller number of parameters and better computational efficiency than CNNs.
But more importantly, I think the big question is about what type of data you have. If large receptive fields are helpful to your problem then transformers will work great. If you need local receptive fields then CNNs will tend to do better (or combinations of transformers and CNNs or reduced receptive fields on transformers). I doubt there will be a one size fits all architecture.
One thing to also keep in mind is that transformers typically like heavy amounts of augmentation. Not all data can be augmented significantly. There's also pre-training and knowledge transfer/distillation.