| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by z991 902 days ago

I commend the authors on making this easy to try! However it doesn't work very well for me for general voice cloning. I read the first paragraph of the wikipedia page on books and had it generate the next sentence. It's obviously computer generated to my ear.

Audio sample: https://storage.googleapis.com/dalle-party/sample.mp3

Cloned voice (converted to mp3): https://storage.googleapis.com/dalle-party/output_en_default...

All I did was install the packages with pip and then run "demo_part1.ipynb" with my audio sample plugged in. Ran almost instantly on my laptop 3070 Ti / 8GB. (Also, I admit to not reading the paper, I just ran the code)

4 comments

dijksterhuis 901 days ago

> It's obviously computer generated to my ear.

From the README

    Disclaimer

    This is an open-source implementation that approximates the performance of the internal voice clone technology of myshell.ai. The online version in myshell.ai has better 1) audio quality, 2) voice cloning similarity, 3) speech naturalness and 4) computational efficiency.

uoaei 901 days ago

So this paper is a thinly veiled ad of myshell.ai's services?

ametrau 901 days ago

Yes. And I used myshell.ai out of interest. It’s also absolutely terrible.

RossDCurrie 898 days ago

I went through downloading the open source version yesterday and tried it with my voice in the microphone, and a few other saved wav files.

It was terrible. Absolutely terrible. Like, given how much hype I saw about this, I expected something half decent. It was not. It was bad, so bad bad bad.

I was thinking maybe I did something wrong, but then I watched some of the youtube reviews - these guys were SO excited at the start of the video and then at the end, they all literally said, "Uh, well, you be the judge"

I still can't help but feel there's some kind of trick to it - get the right input sample, done in the right intonation, and maybe you can generate anything

dvfjsdhgfv 901 days ago

I came here just for your comment. Thank you for doing this work so the rest of us doesn't have to.

gmerc 901 days ago

Like 50% of arxiv. SV figured out that people read papers in 202x, not PRNewsWire and have adjusted accordingly.

3abiton 901 days ago

Not totally unexpected unfortunately. Any other OSS players on the market?

cchance 901 days ago

RVC

fbdab103 902 days ago

Thanks for the real example. Sounded quite generated to my ear as well. Wonder if it can do any better with more source material.

pclmulqdq 902 days ago

Looking at the website and the examples, it's pretty clearly set up to make stylized anime voices.

japanman185 902 days ago

This is the driver for a lot of things. Anime. x264 was to enable better compression of weeb videos. This tech will allow fan dubs to better represent the animes in the videos.

matheusmoreira 901 days ago

Anime also drove the development of a lot of subtitling technology if I remember correctly.

thorum 902 days ago

My experience with other tools like xtts is you really need to have a studio-quality voice sample to get the best results.

amluto 902 days ago

The most obvious problem to my ears is the syllable timing and inflection of the generated speech, and, intuitively, this doesn’t seem like a recording quality issue. It’s as if it did a mostly credible job of emulating the speaker trying to talk like a robot.

hwillis 901 days ago

The biggest trip-up is the pronunciation of "prototypically", and you had "typically" in your original. Maybe it's overfitting to a stilted proto-typically? Could try with a different, less similar sentence

nxobject 901 days ago

That might be the next big contribution – performance in perceptually catching the features of a not-so-good recording – for example, with a webcam style microphone.