| This post on Deep Voice seems a little off-the-mark. In fact, I would say it is completely misleading about the technical accomplishments here. From my perspective, Baidu's approach is a little embarrassing, with the use of many modeling stages in their training and production of TTS. When the rest of the community is moving towards end-to-end training, their usage of this many stages sounds excruciating. Merlin[0], which was a pretty good standard for 2016, has this painful feeling as well, with two DL stages (duration, acoustic) followed by some conditioning and then a synthesis step. The more important technical contribution seems to be the hand-tuned synthesis code that makes their generation faster; cool but not particularly sexy (and there are few details). The details on training hyperparams are nice to have too, of course. Contrary to the post, I would be very surprised if the voice sample included in the post was actually generated by Deep Voice -- it has none of the robotic qualities pointed out by the researchers themselves in their blog post[1]. More likely it is a demonstration of the loss in their last, WaveNet-like stage. This was also pointed out in the previous HN discussion[2] Lastly, Andrew Ng is neither thanked in the paper nor mentioned on any webpage -- are we sure this was work he supervised? [0] https://github.com/CSTR-Edinburgh/merlin [1] http://research.baidu.com/deep-voice-production-quality-text... [2] https://news.ycombinator.com/item?id=13756489 |
- I state the caveats that the voice sample published for Deep Voice are using ground truth features. That being said I can make it clearer.
- Andrew Ng runs the Baidu AI team. He may not have supervised it but he's associated with this.
- I've gotten direct feedback from an original author of this paper to ensure the post represents its accomplishments well. At this point I believe it does save for the caveat.