| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by WarmWash 77 days ago
	Even multimodal models are still really bad when it comes to vision. The strength is still definitely language.

1 comments

nostrebored 76 days ago

Training for tasks still works petty well, but “vision” is a super broad domain and most seem optimized for OCR and screen processing (which have verifiable outputs and relatively straightforward data generation)

link