| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lukeboi 1302 days ago
	Relevant aside: What is state-of-the-art for real time text to speech? Most recent papers & projects I've seen are really high quality but are too slow to synthesize speech in real time.

4 comments

nmfisher 1302 days ago

My app [0] currently uses a mildly customized version of FastSpeech 2 [1] with LPCNet [2] vocoder, which I consider "good quality" @ 16kHz. Faster than realtime on mobile CPU (at least, on anything upwards of a mid-range 2017 device - I can stream practically instantly on my iPhone 11). Using a different vocoder with mobile GPU could probably get even faster (which I don't want to do, for various reasons), and desktop CPU is usually even faster.

There are various other flavours that can deliver faster synthesis (NixTTS comes to mind), but IMO they sacrifice quality even further.

"Good quality" is subjective, obviously. To me, it's perfectly audible, but there's definitely a noticeable difference in quality compared to the heavier diffusion-based models. It's much less crisp and loses some of the more subtle inflections, plosives, etc. For my purposes (language learning), it's fine for the time being but eventually it would be nice to move to a higher-end model.

[0] https://polyvox.app [1] https://arxiv.org/abs/2006.04558 [2] https://github.com/xiph/LPCNet/

link

lionside 1302 days ago

I used to work at Resemble.ai and we used models that did real-time synthesis. I don’t think it’s particularly difficult anymore, even without sacrificing quality.

link

subbu 1301 days ago

Are these models available to the rest of us? on huggingface?

link

pmontra 1302 days ago

If this text was in an ebook my phone could read it aloud in real time. I'm using Cool Reader and Samsung's voices. They feels like TTS but it's OK.

I'm sure there are ways to select any text and make my phone read it in any app but I don't need it and I didn't investigate. Actually I don't need it in ebooks too but I know it's there and I checked that it works.

link

klysm 1302 days ago

What’s real time text to speech mean? Like latency from space bar to spoken?

link

recursive 1302 days ago

Not latency. Like it can synthesize at least as fast as it plays back. Meaning an hour of audio can be generated in an hour or less.

link

junon 1302 days ago

More importantly, can it synthesize as a stream.

link