Hacker News new | ask | show | jobs
by mahmoudfelfel 1343 days ago
The original model (https://play.ht/blog/introducing-truly-realistic-text-to-spe...) was trained on 50k hours of audio, the above voices were just finetuned on the model, only 4-6 hours each.

We just finetuned another voice recently with only 1hr though... I think eventually (soon) we will only need 15-20 mins with zeroshot not even finetuning.