Hacker News new | ask | show | jobs
by lukeboi 1255 days ago
Relevant aside: What is state-of-the-art for real time text to speech?

Most recent papers & projects I've seen are really high quality but are too slow to synthesize speech in real time.

4 comments

My app [0] currently uses a mildly customized version of FastSpeech 2 [1] with LPCNet [2] vocoder, which I consider "good quality" @ 16kHz. Faster than realtime on mobile CPU (at least, on anything upwards of a mid-range 2017 device - I can stream practically instantly on my iPhone 11). Using a different vocoder with mobile GPU could probably get even faster (which I don't want to do, for various reasons), and desktop CPU is usually even faster.

There are various other flavours that can deliver faster synthesis (NixTTS comes to mind), but IMO they sacrifice quality even further.

"Good quality" is subjective, obviously. To me, it's perfectly audible, but there's definitely a noticeable difference in quality compared to the heavier diffusion-based models. It's much less crisp and loses some of the more subtle inflections, plosives, etc. For my purposes (language learning), it's fine for the time being but eventually it would be nice to move to a higher-end model.

[0] https://polyvox.app [1] https://arxiv.org/abs/2006.04558 [2] https://github.com/xiph/LPCNet/

I used to work at Resemble.ai and we used models that did real-time synthesis. I don’t think it’s particularly difficult anymore, even without sacrificing quality.
Are these models available to the rest of us? on huggingface?
If this text was in an ebook my phone could read it aloud in real time. I'm using Cool Reader and Samsung's voices. They feels like TTS but it's OK.

I'm sure there are ways to select any text and make my phone read it in any app but I don't need it and I didn't investigate. Actually I don't need it in ebooks too but I know it's there and I checked that it works.

What’s real time text to speech mean? Like latency from space bar to spoken?
Not latency. Like it can synthesize at least as fast as it plays back. Meaning an hour of audio can be generated in an hour or less.
More importantly, can it synthesize as a stream.