Hacker News new | ask | show | jobs
by z991 902 days ago
I commend the authors on making this easy to try! However it doesn't work very well for me for general voice cloning. I read the first paragraph of the wikipedia page on books and had it generate the next sentence. It's obviously computer generated to my ear.

Audio sample: https://storage.googleapis.com/dalle-party/sample.mp3

Cloned voice (converted to mp3): https://storage.googleapis.com/dalle-party/output_en_default...

All I did was install the packages with pip and then run "demo_part1.ipynb" with my audio sample plugged in. Ran almost instantly on my laptop 3070 Ti / 8GB. (Also, I admit to not reading the paper, I just ran the code)

4 comments

> It's obviously computer generated to my ear.

From the README

    Disclaimer

    This is an open-source implementation that approximates the performance of the internal voice clone technology of myshell.ai. The online version in myshell.ai has better 1) audio quality, 2) voice cloning similarity, 3) speech naturalness and 4) computational efficiency.
So this paper is a thinly veiled ad of myshell.ai's services?
Yes. And I used myshell.ai out of interest. It’s also absolutely terrible.
I went through downloading the open source version yesterday and tried it with my voice in the microphone, and a few other saved wav files.

It was terrible. Absolutely terrible. Like, given how much hype I saw about this, I expected something half decent. It was not. It was bad, so bad bad bad.

I was thinking maybe I did something wrong, but then I watched some of the youtube reviews - these guys were SO excited at the start of the video and then at the end, they all literally said, "Uh, well, you be the judge"

I still can't help but feel there's some kind of trick to it - get the right input sample, done in the right intonation, and maybe you can generate anything

I came here just for your comment. Thank you for doing this work so the rest of us doesn't have to.
Like 50% of arxiv. SV figured out that people read papers in 202x, not PRNewsWire and have adjusted accordingly.
Not totally unexpected unfortunately. Any other OSS players on the market?
RVC
Thanks for the real example. Sounded quite generated to my ear as well. Wonder if it can do any better with more source material.
Looking at the website and the examples, it's pretty clearly set up to make stylized anime voices.
This is the driver for a lot of things. Anime. x264 was to enable better compression of weeb videos. This tech will allow fan dubs to better represent the animes in the videos.
Anime also drove the development of a lot of subtitling technology if I remember correctly.
My experience with other tools like xtts is you really need to have a studio-quality voice sample to get the best results.
The most obvious problem to my ears is the syllable timing and inflection of the generated speech, and, intuitively, this doesn’t seem like a recording quality issue. It’s as if it did a mostly credible job of emulating the speaker trying to talk like a robot.
The biggest trip-up is the pronunciation of "prototypically", and you had "typically" in your original. Maybe it's overfitting to a stilted proto-typically? Could try with a different, less similar sentence
That might be the next big contribution – performance in perceptually catching the features of a not-so-good recording – for example, with a webcam style microphone.