@yacineMTB (Twitter) used Tortoise to diy his own podcast replicating Joe Rogan (by ChatGPT) & the results are amazing worth a quick listen to get the gist [1]
I wrote a script that
- pulled @_akhaliq's last 7 days of tweets
- fished out the arxiv links
- downloaded raw paper .tex
- parsed out intros & conclusions
- automated a podcast dialogue about the papers w/ web automation & GPT
- generated a podcast
As someone that works in generative modeling (vision) there's something that sparks suspicion. They note that these are hand picked results. Has anyone used this and can report the actual quality of results? I bring this up because anyone that has used Stable Diffusion or DallE will know why. Hand picked results are good, but median results matter a lot too.
I'm the author of FakeYou.com and can speak to Tortoise and the TTS field.
Tortoise produces quality results with limited training data, but is an extremely slow model that is not suitable for real time use cases. You can't build an app with it. It's good for creatives making one-off deepfake YouTube videos, and that's about it.
You're looking for Tacotron 2 or one of its offshoots that add multi-speaker, TorchMoji, etc. You'll want to pair it with the Hifi-Gan vocoder to get end-to-end text to speech. (Avoid Griffin-Lim and WaveGlow.)
Your pipeline looks like this at a high level:
Input text => Text pre-processing => Synthesizer => Vocoder => [ Optional transcoding ] => Output audio
TalkNet is also popular when a secondary reference pitch signal is supplied. You can mimic singing and emotion pretty easily.
These three models are faster than real time, and there's a lot of information available and a big community built up around them. FakeYou's Discord has a bunch of people that can show you how to train these models, and there are other Discord communities that offer the same assistance.
If you want to train your own voice using your own collected sample data, you can experiment with it on Google Colab and on FakeYou, then reuse the same model file by hosting it in a cloud GPU instance. We can also do the hosting for you if that's not your desire or forte.
In any case, these models are solid choices for building consumer apps. As long as you have a GPU, you're good to go. If you're not interested in building or maintaining your own, you can use our API! I'd be happy to help.
Thanks for this, I actually appreciate the honesty. It is always difficult for me to parse the actual quality of things I don't have intimate experience with.
Can I ask another question? If I wanted to hack around with STT and TTS (inference only) on a pi (4B+) is there anything that is approximately appropriate and can be done on device? (I could process on my main machine but I'd love to do it on the pi even with a decent delay)
They provide support for running in a Raspberry Pi and it runs in real-time. I have tried the desktop version and the quality is good enough when the audio is clean.
There are other ML TTS models that are both lightweight and can run on a CPU. Check out Glow-TTS for something that will probably work.
Also swap out the HifiGan vocoder for Melgan or MB-Melgan as these will also better support your use case.
I ran this exact setup on cheap Digital Ocean droplets (without GPUs) and it ran faster than real time. It should work on a Pi.
Unfortunately I'm not aware of STT models that operate under these same hardware constraints, but you should be good to go for TTS. With a little bit of poking around, I'm sure you can find a solution for STT too.
> Tortoise is a bit tongue in cheek: this model is insanely slow. It leverages both an autoregressive decoder and a diffusion decoder; both known for their low sampling rates. On a NVidia Tesla K80, expect to generate a medium sized sentence every 2 minutes.
I suspect that for a real(-ish) time TTS system, something else is needed. OTOH if you want to record some voice acting for a game or other multimedia product, it still may be more cost-effective than recording a bunch of live humans.
(K8 = NVidia Tesla K80, GPU, $800-900 for a 24GB version right now.)
Would it still require a 3080 to run adequately, that is, with 1-2 seconds of delay? I've no idea what consumer-grade hardware works well for ML loads.
Kepler, Maxwell, Turing, Volta, ampere, Lovelace, hopper. it's 6 generations old when you include the micro architectures. it would be about a 10x improvement.
NLP is going to have this problem for a long time. Obviously most original research is done by Americans in English. There are really only valid training sets for languages that NLP researchers or engineers speak.
It also works in German and I'm relatively certain it's not translated outside the model itself. I've asked it to generate puns incorporating certain words and while the English results were subjectively somewhat better, the German ones were still "fine" and definitely wouldn't work in English.
ChatGPT is so crazy it even works in fluent Thai. That's better than any machine translation I've ever tried so far. It even takes cultural differences into account. For example when you ask it to translate "I love you" into Thai, it mentions, that normally you would not say this in the same circumstances as you would say it to your lover in the West, correctly explaining in what circumstances people would really use it, and what to use instead. That's revolutionary for minority languages without a lot of learning material available online.
Also I am a native Swiss German speaker. For those who don't know: Swiss German is a dialect continuum, very very different from standard German to an extend, that most untrained German speakers don't understand us. There is no orthography (writing rules), no grammar rules etc. It's a mostly undocumented/unofficial writing system. Only spoken, and the varieties are vast. And guess what, I can write in completely random, informal Swiss German dialect and ChatGPT understands everything, but answers in standard German.