| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nmfisher 1262 days ago

My app [0] currently uses a mildly customized version of FastSpeech 2 [1] with LPCNet [2] vocoder, which I consider "good quality" @ 16kHz. Faster than realtime on mobile CPU (at least, on anything upwards of a mid-range 2017 device - I can stream practically instantly on my iPhone 11). Using a different vocoder with mobile GPU could probably get even faster (which I don't want to do, for various reasons), and desktop CPU is usually even faster.

There are various other flavours that can deliver faster synthesis (NixTTS comes to mind), but IMO they sacrifice quality even further.

"Good quality" is subjective, obviously. To me, it's perfectly audible, but there's definitely a noticeable difference in quality compared to the heavier diffusion-based models. It's much less crisp and loses some of the more subtle inflections, plosives, etc. For my purposes (language learning), it's fine for the time being but eventually it would be nice to move to a higher-end model.

[0] https://polyvox.app [1] https://arxiv.org/abs/2006.04558 [2] https://github.com/xiph/LPCNet/