| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sdenton4 2227 days ago

> two approaches

Hmmmm.... In my experience, for (neural) TTS the dominant option is to have one model to generate a melspectrogram from the text (handling prosody) and a second model for synthesizing samples from the melspectrogram. (Tacotron, Lyrebird, and this Facebook group are all doing this.) There's certainly research projects on going directly from text to samples, but it's not the currently winning strategy... Maybe eventually, though. The Text->Mel portion specializes the prosody and pronunciation problem, and provides a nice place to add extra conditioning.

On the vocoder side: LPC, F0, etc can all be estimated from a reasonably sized melspectrogram; for the most part, these neural models are just letting the big vocoder model handle all of these things which are traditionally (fragile!) subtasks. The question is which "classical" parts are both cheap and reliable: you can compute these on the side and lighten the neural network's burden. LPC is great for this.