Hacker News new | ask | show | jobs
by kastnerkyle 4145 days ago
What techniques are being used for text to speech? Is is something deep learning related or more standard HMM synthesis? Any paper references?
1 comments

According to the documentation[1], it's a concatenative synthesizer using decision trees for prosody modeling and PSOLA for output.

[1]: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercl...

Thanks! I am working in this area and have some ideas for deep learning type methods which move away from concatenative synthesis. It will be nice to compare to what they are using.
We did some work on applying NNs to prosody prediction; see Fernandez, Raul, et al. "Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks." Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH). 2014.
This paper (from ICASSP2013) may be of interest to you: https://static.googleusercontent.com/media/research.google.c...