| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thot_experiment 541 days ago

I've been messing with the open source side of audio generation, and expressiveness still takes work but it's getting there. Roughly summarized my findings are:

- zero shot voice cloning isn't there yet

- gpt-sovits is the best at non-word vocalizations, but the overall quality is bad when just using zero shot, finetuning helps

- F5 and fish-speech are both good as well

- xtts for me has had the best stability (i can rely on it not to hallucinate too much, the others i have to cherrypick more to get good outputs)

- finetuning an xtts model for a few epochs on a particular speaker does wonders, if you have a good utterance library w/ emotions conditioning a finetuned xtts model with that speaker expressing a particular emotion yields something very usable

- you can do speech to speech on the final output of xtts to get to something that (anecdotally) fools most of the people i've tried it on

- non finetuned XTTS zero shot -> seed-vc generates something that's okay also, especially if your conditioning audio is really solid

- really creepy indistinguishable at a casual listen voiceclones of arbitrary people are possible with as little as 30 minutes of speech, the resultant quality captures mannerisms and pacing eerily well, it's easy to get clean input data from youtube videos/podcasts using de-noising/vocal extraction neural nets

TL;DR; use XTTS and pipe it into seed-vc, the e2e on that pipeline on my machine is something like 2x realtime and generates very highly controllable natural sounding voices, you have to manually condition emotive speech

2 comments

cjonas 541 days ago

What scenario would be considered "zero shot" voice cloning?

At the very least, wouldn't you have to provide 1 sample? Which would make it "few shot" (if that term really even makes sense in this context).

link

IanCal 541 days ago

I think the key distinction is that there is no specific training data for that speaker. You can view the input as just the input voice to clone, not training examples.

It would be more like training examples if you had to give it specific phrases.

link

ekianjo 541 days ago

Xtts is non commercial use only though

link

thot_experiment 541 days ago

I think XTTS is MPL now since Coqui folded, but I am not a lawyer and I am not using this for anything commercial so I haven't looked closely.

link