Hacker News new | ask | show | jobs
by whodunser 3392 days ago
This post on Deep Voice seems a little off-the-mark. In fact, I would say it is completely misleading about the technical accomplishments here.

From my perspective, Baidu's approach is a little embarrassing, with the use of many modeling stages in their training and production of TTS. When the rest of the community is moving towards end-to-end training, their usage of this many stages sounds excruciating. Merlin[0], which was a pretty good standard for 2016, has this painful feeling as well, with two DL stages (duration, acoustic) followed by some conditioning and then a synthesis step.

The more important technical contribution seems to be the hand-tuned synthesis code that makes their generation faster; cool but not particularly sexy (and there are few details). The details on training hyperparams are nice to have too, of course.

Contrary to the post, I would be very surprised if the voice sample included in the post was actually generated by Deep Voice -- it has none of the robotic qualities pointed out by the researchers themselves in their blog post[1]. More likely it is a demonstration of the loss in their last, WaveNet-like stage. This was also pointed out in the previous HN discussion[2]

Lastly, Andrew Ng is neither thanked in the paper nor mentioned on any webpage -- are we sure this was work he supervised?

[0] https://github.com/CSTR-Edinburgh/merlin

[1] http://research.baidu.com/deep-voice-production-quality-text...

[2] https://news.ycombinator.com/item?id=13756489

1 comments

Thanks for your feedback.

- I state the caveats that the voice sample published for Deep Voice are using ground truth features. That being said I can make it clearer.

- Andrew Ng runs the Baidu AI team. He may not have supervised it but he's associated with this.

- I've gotten direct feedback from an original author of this paper to ensure the post represents its accomplishments well. At this point I believe it does save for the caveat.