Hacker News new | ask | show | jobs
by vvolhejn 244 days ago
There's a great blog post from Sander Dieleman about exactly this - why do we need a two step pipeline, in particular for images and audio? https://sander.ai/2025/04/15/latents.html

For text, there are a few papers that train the tokenization and language model end-to-end, see: https://arxiv.org/abs/2305.07185