| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by narrationbox 1496 days ago
	Your average mobile processor doesn't have anywhere near enough processing power to run a state of the art text to speech network in real-time. Most text to speech on mobile hardware are stream from the cloud.

4 comments

Arbortheus 1496 days ago

I had a lot of success using FastSpeech2 + MB MelGAN via TensorFlowTTS: https://github.com/TensorSpeech/TensorFlowTTS. There are demos for iOS and Android which will allow you to run pretty convincing, modern TTS models with only a few hundred milliseconds of processing latency.

link

kevin_thibedeau 1496 days ago

Dr. Sbaitso ran on a modest 386. Mobile device processors generally eclipse that and could definitely generate better quality TTS.

link

ben_w 1496 days ago

Not only is state of the art TTS much more demanding (and much much higher quality) than Dr. Sbaitso[0], but so are the not-quite-so-good TTS engines in both Android and iOS.

That said, having only skimmed the paper I didn’t notice a discussion of the compute requirements for usage (just training), but it did say it was a 28.7 million parameter model, so I recon this could be used in real-time on a phone.

[0] judging by the videos of Dr. Sbaitso on YouTube, it was only one step up from the intro to Impossible Mission on the Commodore 64.

link

rob74 1496 days ago

Ok, I get it, state of the art TTS uses AI techniques and so eats processing power, buuuuut seeing that much older efforts which ran on devices like old PCs, the Amiga, the original Macintosh, the Kindle etc. used much less CPU for speech that you could (mostly) understand without problems, it may be worth exploring if it's possible to write a better "dumb" (i.e. non-AI) speech synthesizer?

link

ben_w 1496 days ago

Better than the ones those systems already have? I assume they’ve already got some AI, because without AI, “minute” and “minute” get pronounced the same way because there’s no contextual clue to which instance is the unit of time and which is a fancy way of describing something as very small.

link

DonHopkins 1495 days ago

I'm still hoping that a human being can tell which of the four possible ways to pronounce the name of the English post-punk band, "The The".

https://en.wikipedia.org/wiki/The_The

https://www.youtube.com/watch?v=orIy18qIaCU

link

ben_w 1495 days ago

I have a soft spot for the Yorkshire pronunciation: https://www.youtube.com/watch?v=lzymb0YJp7E&t=160s

link

ccbccccbbcccbb 1496 days ago

The parent didn't mention real-time as a requirement. Offline rendering would well suffice.

link

SemanticStrengh 1496 days ago

28.7 million parameter is nothing for inference

link

snek_case 1496 days ago

Often you can prune parameters as well. You might be able to cut that down by a factor of 10 without any noticeable loss in accuracy.

link