| I'm the author of FakeYou.com and can speak to Tortoise and the TTS field. Tortoise produces quality results with limited training data, but is an extremely slow model that is not suitable for real time use cases. You can't build an app with it. It's good for creatives making one-off deepfake YouTube videos, and that's about it. You're looking for Tacotron 2 or one of its offshoots that add multi-speaker, TorchMoji, etc. You'll want to pair it with the Hifi-Gan vocoder to get end-to-end text to speech. (Avoid Griffin-Lim and WaveGlow.) Your pipeline looks like this at a high level: Input text => Text pre-processing => Synthesizer => Vocoder => [ Optional transcoding ] => Output audio
TalkNet is also popular when a secondary reference pitch signal is supplied. You can mimic singing and emotion pretty easily.These three models are faster than real time, and there's a lot of information available and a big community built up around them. FakeYou's Discord has a bunch of people that can show you how to train these models, and there are other Discord communities that offer the same assistance. If you want to train your own voice using your own collected sample data, you can experiment with it on Google Colab and on FakeYou, then reuse the same model file by hosting it in a cloud GPU instance. We can also do the hosting for you if that's not your desire or forte. In any case, these models are solid choices for building consumer apps. As long as you have a GPU, you're good to go. If you're not interested in building or maintaining your own, you can use our API! I'd be happy to help. |
What would run if you had large set of training data (and time and money) but your focus is on quality? Still Tortoise?