That is some very nice and interesting work!
In fact, I have also worked on exactly the same thing, so I'm impressed by your accomplishments.
How much have you played around with different local condition features, i.e. the phoneme signal? Was it always with 256 Hz? Have you always used nearest-neighbor for upsampling to 16 kHz? Have you always used those 2 + (1 + 2 + 2) * (40 + 5) = 227 dimensions?
We tried just with 39 dimensional phonemes, which also worked but the quality was not so nice and it sounded very robotic, probably due to missing F0. We also only had 100 Hz, but we tried some variants to upscale it to 16 kHz, like linear interpolation or deconv or combinations of them.
In the local conditioning network, you used QRNNs. Did you also try simpler methods, like just pure convolution? (And then the upsampling like you did, by nearest neighbor.)
You are predicting phone duration + F0. Have you also tried an encoder-decoder approach instead, like in Char2Wav? I.e. instead of the duration prediction, you let the decoder unroll it. Then, also like Char2Wav, you can also combine that directly with your Grapheme-to-Phoneme model. Have you tried that?
Did you also try some global condition, like speaker identity?
We also tried all the sampling methods you are listing and observed the same behavior, i.e. only the direct sampling really works. I tried many more deterministic variants (like taking mean) but none of them worked. This is a bit strange. Also the quality can vary depending on the random seed.
Feel free to get in touch for more Q/A, my email is in my profile.
We've experimented a bunch with many of these hyperparameters. Our phoneme signal has mostly stayed 256 Hz, but we've done a few experiments with lower-frequency signals that indicate it's probably possible to reduce it.
We have used many types of upsampling, and find that the upsampling and conditioning procedure does not affect the quality of the audio itself, but does affect the frequency of pronunciation mistakes. We used bicubic and bilinear interpolation based upsampling, as well as transposed convolutions and a variety of other simpler convolutions (for example, per-channel transposed convolutions). These tend to work and converge, but then generate pronunciation mistakes on difficult phonemes. A full transposed convolution upsampling (two transposed convolution layers with stride 8 each) works almost as well as our bidirectional QRNNs, but it's much, much, more expensive in terms of compute and parameters, and takes longer to train as well.
As noted in the paper, we used many of the original features used for WaveNet before reducing our feature set. F0 is definitely important for proper intonation. We find that including the surrounding phonemes is quite important; with the bidirectional QRNN upsampling, leaving those out still works, but not nearly as well. It seems likely that a different conditioning network would remove the need for those "context" phonemes.
We have not yet used an encoder-decoder approach for duration or F0. Char2Wav has a bunch of interesting ideas, and it may be a direction for our future work. However, we do not plan on including the grapheme-to-phoneme model into our main model, because it's crucial that we easily affect the pronunciation of phonemes with a phoneme dictionary; by having an explicit grapheme-to-phoneme step, we can easily set the pronunciation for unseen words (like "P!nk" or "Worcestershire"; an integrated grapheme-to-phoneme model would not be able to do those, even humans usually cannot!).
We have not yet worked with speaker global conditioning, but it is likely that the results from the WaveNet paper apply to our WaveNet implementation as well.
Finally, as for sampling, we have not seen much variation due to random seed for a fully converged model. However, our intuition for why sampling is important is that the speech distribution is (a) multimodal and (b) biased towards silence. If you are interested, you can gain a little bit of intuition about what the distribution actually looks like by just plotting a color map across time, with high-probability values being bright and low probability values being dark; it generates a pretty plot, and you can see that some areas are clearly stochastic (especially fricatives) and some areas are multimodal (vowel wave peaks).
It's hard to say! We don't quite know exactly how many parameters or minutes of audio are needed to describe fully someone's voice and speaking patterns. Maybe one or two, maybe much more.
I don't quite know what VoCo does, but it seems like a concatenative system that they've tuned a huge amount. I'm a little skeptical that it works as well and as reliably in real life as it does in demos. But, even so, there parametric models tend to be much smaller in size and more flexible, so there may be applications where WaveNet-style systems are applicable in ways concatenative systems can't handle (high quality on-device TTS, emotive TTS, speaker synthesis for new unheard speakers, etc).
A simpler problem could be to identify someone based on voice. Is that problem already solved? And can we use this to solve the problem of generating someone's voice?
That has been possible for years, and is even a typical student assignment in speech processing courses. A quick search gave this example course at Cornell
Afaik VoCo isn't creating anything from thin air, instead it scans the available voice data (it reportedly needs a sample of about 20 mins of a person speaking) and copies fragments of it in specific order to create a sentence.
Hi Andrew, congratulations on your result! A few questions, feel free to answer one or any. How close do you think you are to having fully end-to-end models for speech? Are you optimistic we can get speech synthesis to run on mobile devices in the near future? Do the inference optimizations (particularly sample embedding and layer inference) generalize well to other architectures, like speech recognition? It seems that if these models are going to run offline in realtime on mobile devices, we will need to have specialized hardware, but maybe we can squeeze enough performance out of mobile CPUs to get a highly optimized version to work. Thanks!
For fully end-to-end models, it's hard to say exactly. The Char2Wav paper demonstrates that there is hypothetically an architecture and a set of weights that can do synthesis end-to-end, but we cannot yet train such a system. On Reddit, one of the Char2Wav authors comments that they tried training it directly and didn't get great results, and at SVAIL we've also had some trouble doing so. I think it is very likely going to happen in the next several months or year, but we don't yet know exactly what needs to happen in order to get it to work.
As for inference, some of the inference optimizations do generalize. In fact, the GPU optimizations (persistent kernels) were originally developed by our systems team, and published in the Persistent RNN [0] paper. (This is a really powerful technique that CUDA makes very hard to implement, and I have a massive amount of respect for the folks who managed to make it work!) Persistent RNNs make training at close-to-peak-FLOPs with very low batch sizes plausible, and make GPU WaveNet inference plausible. At the moment, our CPU kernels are much more promising, but we don't know whether that will stay the case. For mobile, I think it is possible to get the current systems to work on fairly powerful mobile CPUs with a bunch more work into optimization and low-level assembly, but we haven't done it yet so time will tell.
>> Are you optimistic we can get speech synthesis to run on mobile devices in the near future?
You mean high quality right? I mean speech synth has been around for decades that can run on cheap hardware and is understandable. Speech recognition has also been around for a long time, but there's a huge difference in usability between "pretty good recognition" and "pretty good synthesis". One is useful, the other not so much.
Is there an implementation of this to check out? It seems like you needed to write some custom, low-level code to implement this in real-time. Which libraries did you use to generate the ANNs and do the inferences?
We are not currently releasing any code, but hopefully the paper on arxiv is enough to make it easy to reproduce the result.
We use TensorFlow for writing and training the model and c++ with a lot of hand optimizations for inference, with assembly kernels written with PeachPy (which is an awesome piece of software!)
Baidu as a company doesn't use TensorFlow (as far as I know). We have our own high-performance and easy to use open source framework called PaddlePaddle [0], which is quite powerful and flexible.
However, the Baidu Research Silicon Valley AI Lab (SVAIL) allows researchers and research teams to use whatever frameworks they want to, and we have projects using TensorFlow, Torch, our own SVAIL-internal internal high-performance RNN framework, and PaddlePaddle. Using our own framework sometimes allows us to work on very high-performance implementations of various primitives and techniques that would be harder to do without complete control over the source code.
We didn't actually try LSTMs, because we train in 1.25 second chunks, so running an LSTM for several hundred timesteps would drastically slow down training. Our per iteration time was in the 200-500 milliseconds, and using an LSTM or GRU would likely bump that into the 1-3 second range, maybe more, whereas the QRNN conditioning actually make it cheaper than the transposed convolution conditioning by 20-40%.
The upsampling procedure is quite finicky, so we had quite a few iterations there, but we didn't have to tune hyperparameters too much of the QRNN itself. Once we implemented the QRNN in CUDA for TensorFlow and got it to train, it worked without too much trouble.
Our collaborators in Beijing mentioned that bidirectional LSTMs also worked in a similar way, though.
For those of us interested in this area of research what are the best papers and other resources for us to read? Has there been any success with deep approaches that do not have the WaveNet architecture?
Check out Char2Wav (recent) and SampleRNN (the RNN-based audio synthesis architecture). The related work section of the Deep Voice paper mention a bunch of related papers that are relevant!
We take several days (2-3) on 8 Titan X GPUs to train our models, which is quite a lot of compute. Running on mobile devices is quite challenging – the inference is not yet fast enough to support that, and has only been optimized for x86 AVX2 CPUs. It may be possible with a fair amount of future work!
That is some very nice and interesting work! In fact, I have also worked on exactly the same thing, so I'm impressed by your accomplishments.
How much have you played around with different local condition features, i.e. the phoneme signal? Was it always with 256 Hz? Have you always used nearest-neighbor for upsampling to 16 kHz? Have you always used those 2 + (1 + 2 + 2) * (40 + 5) = 227 dimensions? We tried just with 39 dimensional phonemes, which also worked but the quality was not so nice and it sounded very robotic, probably due to missing F0. We also only had 100 Hz, but we tried some variants to upscale it to 16 kHz, like linear interpolation or deconv or combinations of them.
In the local conditioning network, you used QRNNs. Did you also try simpler methods, like just pure convolution? (And then the upsampling like you did, by nearest neighbor.)
You are predicting phone duration + F0. Have you also tried an encoder-decoder approach instead, like in Char2Wav? I.e. instead of the duration prediction, you let the decoder unroll it. Then, also like Char2Wav, you can also combine that directly with your Grapheme-to-Phoneme model. Have you tried that?
Did you also try some global condition, like speaker identity?
We also tried all the sampling methods you are listing and observed the same behavior, i.e. only the direct sampling really works. I tried many more deterministic variants (like taking mean) but none of them worked. This is a bit strange. Also the quality can vary depending on the random seed.
Thanks, Albert