Hacker News new | ask | show | jobs
by Riccardo_G 2412 days ago
What it is doing is not really cloning, but because it was trained on 18k different voices, it actually finds one that is closest to yours, and uses that one. It can do a bit of interpolation to create an embedding which is closer to your own, but only if it is well represented by a mix of other voices. Real voice cloning like at https://replicastudios.com/ can take just a minute or two of audio, and it does a fairly good job, and it is always improving. With more audio you start being able to also play with emotion and styles, which is very cool!
2 comments

I'm not really sure where you're getting this. It doesn't pick a specific voice from a database to use.

From their introduction: "Our approach is to decouple speaker modeling from speech synthesis by independently training a speaker-discriminative embedding network that captures the space of speaker characteristics and training a high quality TTS model on a smaller dataset conditioned on the representation learned by the first network."

Section 2 of the paper explains how it works. Two minute papers also goes through it if you'd prefer a video. Link: https://youtu.be/0sR1rU3gLzQ

They’re saying that underrepresented voices will have trouble being modeled. That matches my experience with this project: for example, I had a very tough time cloning female voices compared to nerdy-sounding / deep male voices.
It's more that the sounds produced during the recordings didn't cover the entire spectrum of possible sounds, so the model had to estimate their sound. All you really need is a paragraph which you can have the person read to get sufficient coverage or just enough recordings that it's not an issue anymore.
Was it 18k voices or samples? Also, is it finding the closest voice, or is it a continuous parameter space formed from the voices?