| This is HN, so I'm surprised that no one in the comments section has run this locally. :) Following the instructions in their repo (and moving the checkpoints/ and resources/ folder into the "nested" openvoice subfolder), I managed to get the Gradio demo running. Simple enough. It appears to be quicker than XTTS2 on my machine (RTX 3090), and utilizes approximately 1.5GB of VRAM. The Gradio demo is limited to 200 characters, perhaps for resource usage concerns, but it seems to run at around 8x realtime (8 seconds of speech for about 1 second of processing time.) EDIT: patched the Gradio demo for longer text; it's way faster than that. One minute of speech only took ~4 seconds to render. Default voice sample, reading this very comment: https://voca.ro/18JIHDs4vI1v
I had to write out acronyms -- XTTS2 to "ex tee tee ess two", for example. The voice clarity is better than XTTS2, too, but the speech can sound a bit stilted and, well, robotic/TTS-esque compared to it. The cloning consistency is definitely a step above XTTS2 in my experience -- XTTS2 would sometimes have random pitch shifts or plosives/babble in the middle of speech. |
I was able to run the demos allright, but when trying to use another reference speaker (in demo_part1), the result doesn't sound at all like the source (it's just a random male voice).
I'm also trying to produce French output, using a reference audio file in French for the base speaker, and a text in French. This triggers an error in api.py line 75 that the source language is not accepted.
Indeed, in api.py line 45 the only two source languages allowed are English and Chineese; simply adding French to language_marks in api.py line 43 avoids errors but produces a weird/unintelligible result with a super heavy English accent and pronunciation.
I guess one would need to generate source_se again, and probably mess with config.json and checkpoint.pth as well, but I could not find instructions on how to do this...?
Edit -- tried again on https://app.myshell.ai/ The result sounds French alright, but still nothing like the original reference. It would be absolutely impossible to confuse one with the other, even for someone who didn't know the person very well.