| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by runeblaze 240 days ago
	each text token is often subword unit, but in VLMs the visual tokens are in semantic space. Semantic space obviously compresses much more than subword slices. disclaimer: not expert, on top of my head