| I've been messing with the open source side of audio generation, and expressiveness still takes work but it's getting there. Roughly summarized my findings are: - zero shot voice cloning isn't there yet - gpt-sovits is the best at non-word vocalizations, but the overall quality is bad when just using zero shot, finetuning helps - F5 and fish-speech are both good as well - xtts for me has had the best stability (i can rely on it not to hallucinate too much, the others i have to cherrypick more to get good outputs) - finetuning an xtts model for a few epochs on a particular speaker does wonders, if you have a good utterance library w/ emotions conditioning a finetuned xtts model with that speaker expressing a particular emotion yields something very usable - you can do speech to speech on the final output of xtts to get to something that (anecdotally) fools most of the people i've tried it on - non finetuned XTTS zero shot -> seed-vc generates something that's okay also, especially if your conditioning audio is really solid - really creepy indistinguishable at a casual listen voiceclones of arbitrary people are possible with as little as 30 minutes of speech, the resultant quality captures mannerisms and pacing eerily well, it's easy to get clean input data from youtube videos/podcasts using de-noising/vocal extraction neural nets TL;DR; use XTTS and pipe it into seed-vc, the e2e on that pipeline on my machine is something like 2x realtime and generates very highly controllable natural sounding voices, you have to manually condition emotive speech |
At the very least, wouldn't you have to provide 1 sample? Which would make it "few shot" (if that term really even makes sense in this context).