| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mountainriver 756 days ago
	Vision language models use various encoders to project the image into tokens. This is just a means of a unified encoder across modalities