| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lisperforlife 739 days ago
	Why is this not the top comment? FAIR published their C3MLeon paper about decoder-only autoregressive models that work with both text and image tokens. I believe GPT-4o's vocabulary has room for both image and audio tokens. For audio tokens, they probably trained an RVQ-VAE model like Encodec or Soundstream.