|
|
|
|
|
by gridlockd
2590 days ago
|
|
Haven't read what they're doing, but chances are they are using an adversial neural network. The job of the adversarial network is to tell apart real from fake. The job of the neural network is to fool the adverserial network. Both are trained in tandem. One could imagine training another adverserial network that isn't used to train the network itself, and so will pick up on nuances that the original adversary doesn't pick up on. Anyone could do that, I don't think it's the author's responsibility. Somewhat related: https://keenlab.tencent.com/en/2019/03/29/Tencent-Keen-Secur... |
|
Generative-adversarial models have had a lot of success in image generation; however, the same cannot be said for speech synthesis.
Unless they have figured out a new technique, they are probably using Tacotron 2 (https://ai.googleblog.com/2017/12/tacotron-2-generating-huma...). Google's Tacotron 2 already achieved human-parity TTS without adversarial training as measured by MOS.