Hacker News new | ask | show | jobs
by mazoza 596 days ago
I dont actually see any tokens used in the model. It seems like the model actually predicts latents and then VAE converts back to audio. More like Tortoise or XTTS