Hacker News new | ask | show | jobs
by WarmWash 77 days ago
Even multimodal models are still really bad when it comes to vision. The strength is still definitely language.
1 comments

Training for tasks still works petty well, but “vision” is a super broad domain and most seem optimized for OCR and screen processing (which have verifiable outputs and relatively straightforward data generation)