|
|
|
|
|
by menaerus
178 days ago
|
|
Using transformers does not mutually exclude other tools in the sleeve. What about DINOv2 and DINOv3, 1B and 7B, vision transformer models? This paper [1] suggests significant improvements over traditional YOLO-based object detection. [1] https://arxiv.org/html/2509.20787v2 |
|
IMO there is little reason to think transformers are (even today) the best architecture for any deep learning application. Perhaps if a mega-corp poured all their resources into some convolutional transformer architecture, you'd get something better than just the current vision transformer (ViT) models, but, since so much optimizations and work on the training of ViTs has been done, and since we clearly still haven't maxed out their capacity, it makes sense to stick with them at scale.
That being said, ViTs are still currently clearly the best if you want something trained on a near-entire-internet of image or video data.
[1] https://arxiv.org/abs/2103.15808
[2] https://scholar.google.ca/scholar?hl=en&as_sdt=0%2C5&q=convo...