There's a great blog post from Sander Dieleman about exactly this - why do we need a two step pipeline, in particular for images and audio?
https://sander.ai/2025/04/15/latents.html
For text, there are a few papers that train the tokenization and language model end-to-end, see: https://arxiv.org/abs/2305.07185