Hacker News new | ask | show | jobs
by narrationbox 1496 days ago
Your average mobile processor doesn't have anywhere near enough processing power to run a state of the art text to speech network in real-time. Most text to speech on mobile hardware are stream from the cloud.
4 comments

I had a lot of success using FastSpeech2 + MB MelGAN via TensorFlowTTS: https://github.com/TensorSpeech/TensorFlowTTS. There are demos for iOS and Android which will allow you to run pretty convincing, modern TTS models with only a few hundred milliseconds of processing latency.
Dr. Sbaitso ran on a modest 386. Mobile device processors generally eclipse that and could definitely generate better quality TTS.
Not only is state of the art TTS much more demanding (and much much higher quality) than Dr. Sbaitso[0], but so are the not-quite-so-good TTS engines in both Android and iOS.

That said, having only skimmed the paper I didn’t notice a discussion of the compute requirements for usage (just training), but it did say it was a 28.7 million parameter model, so I recon this could be used in real-time on a phone.

[0] judging by the videos of Dr. Sbaitso on YouTube, it was only one step up from the intro to Impossible Mission on the Commodore 64.

Ok, I get it, state of the art TTS uses AI techniques and so eats processing power, buuuuut seeing that much older efforts which ran on devices like old PCs, the Amiga, the original Macintosh, the Kindle etc. used much less CPU for speech that you could (mostly) understand without problems, it may be worth exploring if it's possible to write a better "dumb" (i.e. non-AI) speech synthesizer?
Better than the ones those systems already have? I assume they’ve already got some AI, because without AI, “minute” and “minute” get pronounced the same way because there’s no contextual clue to which instance is the unit of time and which is a fancy way of describing something as very small.
I'm still hoping that a human being can tell which of the four possible ways to pronounce the name of the English post-punk band, "The The".

https://en.wikipedia.org/wiki/The_The

https://www.youtube.com/watch?v=orIy18qIaCU

I have a soft spot for the Yorkshire pronunciation: https://www.youtube.com/watch?v=lzymb0YJp7E&t=160s
The parent didn't mention real-time as a requirement. Offline rendering would well suffice.
28.7 million parameter is nothing for inference
Often you can prune parameters as well. You might be able to cut that down by a factor of 10 without any noticeable loss in accuracy.