| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Legend2440 384 days ago
	It used to be done that way, but newer multimodal LLMs train on a mix of image and text tokens, so they don’t need a separate image encoder. There is just one model that handles everything.