| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fab1an 1367 days ago
	amazing work! Curious how you cloned the voices - tortoise? I've previously tried Herzog, but couldn't quite train the German accent...

2 comments

jamez 1367 days ago

I haven't tried Tortoise, thanks for pointing me to it. The voices were cloned by fine tuning a VITS model with coqui.ai. I used about two hours of speech for each speaker. With more time and resources, I'm certain it's possible to make those voices considerably better.

link

OgAstorga 1367 days ago

Can I get an invite link?

link

jamez 1367 days ago

No need to be invited. Between their GitHub[1] page and the documentation[2], you'll find everything you need to get started.

[1] https://github.com/coqui-ai/TTS [2] https://tts.readthedocs.io/en/latest/

link

forgingahead 1367 days ago

How long did you train the models for each speaker, and what hardware were you using?

link

jamez 1366 days ago

It was fine tuning, so the process was a lot faster than I originally anticipated. I'd say it was between 36 and 72 hours for each voice. I have been working on a gradient notebook provided by Paperspace, which guaranteed me A6000 instances (48GB GPU RAM) at a reasonable flat rate. I discovered them after being repeatedly frustrated by the random allocation of GPUs on colabs pro+ plan.

link

hanselot 1367 days ago

How much input audio would you need to produce audiobook quality? Hint Hint...

link

blueberrychpstx 1367 days ago

https://coqui.ai?referralCode=q8jfhfs&refSource=copy help us move up the list!

link

biztos 1367 days ago

I don’t know if this is useful, but Herzog has a distinctly Bavarian accent. And of course has spent most of his adult life far from there, so it’s not quite Bavarian either.

Training a Herzogbot on recordings/transcriptions of, say, Kinski would be a waste of time accent-wise.

link