Hacker News new | ask | show | jobs
by robeastham 3569 days ago
I was very impressed with the TTS examples in the original DeepMind article (https://deepmind.com/blog/wavenet-generative-model-raw-audio...).

Can someone elaborate on the usefulness of this implementation for Text-to-Speech?

I'm keen to experiment with voice synthesis. I want to create dialog, from multiple voice sources, for some characters in a VR application that I'm working on.

Perhaps this lib is a better option for TTS:

https://github.com/ibab/tensorflow-wavenet

I guess I could do with an ELI5 on how I'd approach this with either of these libraries. I'm not familiar with any deep learning frameworks. But I am pretty handy with Python and have implemented SciKit stuff.

Also thinking this will give me a reason to try Azure K80 instance vs the AWS GPU instances I've been using for other stuff. That said, is a Tesla K80 the only option for WaveNet? I'm guessing I could run it on other GPU's but had read that memory might be an issue on some cards. If so what the lowest card I can run it on and will one of the AWS GPU instances suffice? I also have a GTX 970 at home, but I'm guessing that won't cut it.

2 comments

To my understanding - WaveNet itself is quite resource heavy to run see this thread: https://news.ycombinator.com/item?id=12501204

I'm looking forward to seeing faster implementations in the future, playing around with this looks like a lot of fun.

Short answer: Don't use this for practical purposes. It takes 90 minutes to generate 1 second of audio.

Here's a good TTS system:

http://www.cstr.ed.ac.uk/projects/festival/

If you have phonetically rich source data, festival will work pretty well. If you need a little more flex in your system, and can deal with a super weird training process, HTS is probably a better choice. With a small amount of work, you can use consolidated HTS models from within festival. (http://hts.sp.nitech.ac.jp/)

Further, if you pine for the fjords of DNN-land, merlin (https://github.com/CSTR-Edinburgh/merlin) is brand new and looking to make things a little easier for everybody.

But does anyone know if it's possible to do TTS with the recently released libraries?

Thanks for the links, but to my ear the samples on those links don't hit the mark. The Wavenet samples in the original article cross the threshold for me. So I'd like to try some short length dialog tests, especially as I've read elsewhere that 1 second only takes 4 minutes on a K80.

Any light anyone else can shed on this would be great.

Afaik none of the released libraries support the TTS experiment described in the paper. Deepmind used pre-computed linguistic features to guide the system in generating natural sounding speech, so your output will probably depend on the quality of those features. For the sake of not spreading misinformation; the 4 minutes was measured using a small model with a sampling rate of 4khz, this would not generate something sounding like the samples from Deepmind.
Thanks for the clarification and for spotting the 4khz error. This is fascinating stuff.

Looks like I'll have to concede that voice acting is much more practical, for now at least.