|
|
|
|
|
by fotcorn
950 days ago
|
|
The recently released XTTS-v2 model[0] from coqui.ai is coming very close to what ElevenLabs[1] can do. It runs reasonably fast on a recent GPU, and should also work on CPU. Requires a 3 second (!) clip of the voice you want to clone. License does not allow commercial use. 0: https://huggingface.co/coqui/XTTS-v2 1: https://elevenlabs.io/ |
|
Sure, if you want a guaranteed uncanny valley experience. There is no way a few seconds are enough to cover all the ways a specific person pronounces things. A person's voice is much more than just the pitch and with a 3 second sample anyone who knows them will be able to tell something's off within 3 seconds.