| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vvolhejn 244 days ago
	There's a great blog post from Sander Dieleman about exactly this - why do we need a two step pipeline, in particular for images and audio? https://sander.ai/2025/04/15/latents.html For text, there are a few papers that train the tokenization and language model end-to-end, see: https://arxiv.org/abs/2305.07185