|
|
|
|
|
by Riccardo_G
2412 days ago
|
|
What it is doing is not really cloning, but because it was trained on 18k different voices, it actually finds one that is closest to yours, and uses that one. It can do a bit of interpolation to create an embedding which is closer to your own, but only if it is well represented by a mix of other voices. Real voice cloning like at https://replicastudios.com/ can take just a minute or two of audio, and it does a fairly good job, and it is always improving. With more audio you start being able to also play with emotion and styles, which is very cool! |
|
From their introduction: "Our approach is to decouple speaker modeling from speech synthesis by independently training a speaker-discriminative embedding network that captures the space of speaker characteristics and training a high quality TTS model on a smaller dataset conditioned on the representation learned by the first network."
Section 2 of the paper explains how it works. Two minute papers also goes through it if you'd prefer a video. Link: https://youtu.be/0sR1rU3gLzQ