Hacker News new | ask | show | jobs
by JaRail 2412 days ago
I'm not really sure where you're getting this. It doesn't pick a specific voice from a database to use.

From their introduction: "Our approach is to decouple speaker modeling from speech synthesis by independently training a speaker-discriminative embedding network that captures the space of speaker characteristics and training a high quality TTS model on a smaller dataset conditioned on the representation learned by the first network."

Section 2 of the paper explains how it works. Two minute papers also goes through it if you'd prefer a video. Link: https://youtu.be/0sR1rU3gLzQ

1 comments

They’re saying that underrepresented voices will have trouble being modeled. That matches my experience with this project: for example, I had a very tough time cloning female voices compared to nerdy-sounding / deep male voices.
It's more that the sounds produced during the recordings didn't cover the entire spectrum of possible sounds, so the model had to estimate their sound. All you really need is a paragraph which you can have the person read to get sufficient coverage or just enough recordings that it's not an issue anymore.