| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jamez 1367 days ago
	I haven't tried Tortoise, thanks for pointing me to it. The voices were cloned by fine tuning a VITS model with coqui.ai. I used about two hours of speech for each speaker. With more time and resources, I'm certain it's possible to make those voices considerably better.

1 comments

OgAstorga 1367 days ago

Can I get an invite link?

link

jamez 1367 days ago

No need to be invited. Between their GitHub[1] page and the documentation[2], you'll find everything you need to get started.

[1] https://github.com/coqui-ai/TTS [2] https://tts.readthedocs.io/en/latest/

link

forgingahead 1367 days ago

How long did you train the models for each speaker, and what hardware were you using?

link

jamez 1366 days ago

It was fine tuning, so the process was a lot faster than I originally anticipated. I'd say it was between 36 and 72 hours for each voice. I have been working on a gradient notebook provided by Paperspace, which guaranteed me A6000 instances (48GB GPU RAM) at a reasonable flat rate. I discovered them after being repeatedly frustrated by the random allocation of GPUs on colabs pro+ plan.

link

hanselot 1367 days ago

How much input audio would you need to produce audiobook quality? Hint Hint...

link

blueberrychpstx 1367 days ago

https://coqui.ai?referralCode=q8jfhfs&refSource=copy help us move up the list!

link