Hacker News new | ask | show | jobs
by wczekalski 950 days ago
One thing I've seen done for style cloning is a high quality fine tuned TTS -> RVC pipeline to "enhance" the output. TTS for intonation + pronunciation, RVC for voice texture. With StyleTTS and this pipeline you should get close to ElevenLabs.
2 comments

I suspect they are doing many more things to make it sounds better. I certainly hope open source solutions can approach that level of quality, but so far I've been very disappointed.
RVC? R… Voice Model?
Retrieval-based voice conversion, apparently.