|
|
|
|
|
by PieSquared
3390 days ago
|
|
Thank you for clarifying this! We tried fairly hard to make this clear, because as you say, the hard part is generating inflection and duration that sounds natural. There's still a ton of work left to do in this duration – we're clearly nowhere near being able to generate human-level speech. Our work is meant to make working with TTS easier to deep learning researchers by describing a complete and trainable system that can be trained completely from data, and demonstrate that the neural vocoder substitutes can actually be deployed to streaming production servers. Future work (both by us and hopefully other groups) will make further progress for inflection synthesis! |
|