Y
Hacker News
new
|
ask
|
show
|
jobs
by
mazoza
596 days ago
I dont actually see any tokens used in the model. It seems like the model actually predicts latents and then VAE converts back to audio. More like Tortoise or XTTS