Y
Hacker News
new
|
ask
|
show
|
jobs
by
WarmWash
77 days ago
Even multimodal models are still really bad when it comes to vision. The strength is still definitely language.
1 comments
nostrebored
76 days ago
Training for tasks still works petty well, but “vision” is a super broad domain and most seem optimized for OCR and screen processing (which have verifiable outputs and relatively straightforward data generation)
link