|
|
|
|
|
by PieSquared
2481 days ago
|
|
I'm an author on a few of these papers referenced (the Deep Voice papers from Baidu). I'm happy to answer any questions folks may have about neural speech synthesis, as I've been working on this for several years now. In general, it's a fascinating space. There are challenges in text processing (not even mentioned in the blog), such as grapheme to phoneme conversion, part of speech detection, word sense disambiguation, text normalization, challenges in utterance-level modeling (spectrograms), and challenges in "spectrogram inversion" / waveform synthesis. The NLP components of the pipeline are often overlooked but are no less important than they were a few years ago -- part of speech / word sense is the difference between "Time is a CONstruct" and "I'm going to conSTRUCT a tower", and is the difference between "Let's drop that bass" being about a DJ or about a fish. The acoustic modeling phase (e.g. Tacotron, Deep Voice 3) works fairly well, and can produce some awesome demos with things like style tokens ("GST-Tacotron"), but still has a ways to go until it can encompass the full range of human inflection and emotion. At the waveform synthesis level, models like WaveRNN (with subscale modeling) and Parallel WaveNet make it possible to deploy modern waveform synthesis models, but it's still a major issue to deploy them onto low-power devices due to compute restrictions. Overall, lots of interesting challenges to work on, and we're making a lot of progress quite quickly; and I haven't even started talking about voice conversion or voice cloning! |
|
If we're stuck with downsampling to 16 Khz, my question still stands.