Hacker News new | ask | show | jobs
by albertzeyer 3393 days ago
Hi,

That is some very nice and interesting work! In fact, I have also worked on exactly the same thing, so I'm impressed by your accomplishments.

How much have you played around with different local condition features, i.e. the phoneme signal? Was it always with 256 Hz? Have you always used nearest-neighbor for upsampling to 16 kHz? Have you always used those 2 + (1 + 2 + 2) * (40 + 5) = 227 dimensions? We tried just with 39 dimensional phonemes, which also worked but the quality was not so nice and it sounded very robotic, probably due to missing F0. We also only had 100 Hz, but we tried some variants to upscale it to 16 kHz, like linear interpolation or deconv or combinations of them.

In the local conditioning network, you used QRNNs. Did you also try simpler methods, like just pure convolution? (And then the upsampling like you did, by nearest neighbor.)

You are predicting phone duration + F0. Have you also tried an encoder-decoder approach instead, like in Char2Wav? I.e. instead of the duration prediction, you let the decoder unroll it. Then, also like Char2Wav, you can also combine that directly with your Grapheme-to-Phoneme model. Have you tried that?

Did you also try some global condition, like speaker identity?

We also tried all the sampling methods you are listing and observed the same behavior, i.e. only the direct sampling really works. I tried many more deterministic variants (like taking mean) but none of them worked. This is a bit strange. Also the quality can vary depending on the random seed.

Thanks, Albert

1 comments

Feel free to get in touch for more Q/A, my email is in my profile.

We've experimented a bunch with many of these hyperparameters. Our phoneme signal has mostly stayed 256 Hz, but we've done a few experiments with lower-frequency signals that indicate it's probably possible to reduce it.

We have used many types of upsampling, and find that the upsampling and conditioning procedure does not affect the quality of the audio itself, but does affect the frequency of pronunciation mistakes. We used bicubic and bilinear interpolation based upsampling, as well as transposed convolutions and a variety of other simpler convolutions (for example, per-channel transposed convolutions). These tend to work and converge, but then generate pronunciation mistakes on difficult phonemes. A full transposed convolution upsampling (two transposed convolution layers with stride 8 each) works almost as well as our bidirectional QRNNs, but it's much, much, more expensive in terms of compute and parameters, and takes longer to train as well.

As noted in the paper, we used many of the original features used for WaveNet before reducing our feature set. F0 is definitely important for proper intonation. We find that including the surrounding phonemes is quite important; with the bidirectional QRNN upsampling, leaving those out still works, but not nearly as well. It seems likely that a different conditioning network would remove the need for those "context" phonemes.

We have not yet used an encoder-decoder approach for duration or F0. Char2Wav has a bunch of interesting ideas, and it may be a direction for our future work. However, we do not plan on including the grapheme-to-phoneme model into our main model, because it's crucial that we easily affect the pronunciation of phonemes with a phoneme dictionary; by having an explicit grapheme-to-phoneme step, we can easily set the pronunciation for unseen words (like "P!nk" or "Worcestershire"; an integrated grapheme-to-phoneme model would not be able to do those, even humans usually cannot!).

We have not yet worked with speaker global conditioning, but it is likely that the results from the WaveNet paper apply to our WaveNet implementation as well.

Finally, as for sampling, we have not seen much variation due to random seed for a fully converged model. However, our intuition for why sampling is important is that the speech distribution is (a) multimodal and (b) biased towards silence. If you are interested, you can gain a little bit of intuition about what the distribution actually looks like by just plotting a color map across time, with high-probability values being bright and low probability values being dark; it generates a pretty plot, and you can see that some areas are clearly stochastic (especially fricatives) and some areas are multimodal (vowel wave peaks).