| This can be used to implement seamless voice performance transfer from one speaker to another: 1. Train a WaveNet with the source speaker. 2. Train a second WaveNet with the target speaker. Or for something totally new, train a WaveNet with a bunch of different speakers until you get one you like. This becomes the target WaveNet. 3. Record raw audio from the source speaker. Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse" to recover those inputs given the rendered output. In this case, we now have raw audio from the source speaker that—in principle— could have been rendered by the source speaker's WaveNet, and we want to recover the inputs that would have rendered it, had we done so. To do that, usually you convert all numbers in the forward renderer into Dual numbers and use automatic differentiation to recover the inputs (in this case, phonemes and what not). 4. Recover the inputs. (This is computationally expensive, but not difficult in practice, especially if WaveNet's generation algorithm is implemented in C++ and you've got a nice black-box optimizer to apply to the inputs, of which there are many freely available options.) 5. Take the recovered WaveNet inputs, feed them into the target speaker's WaveNet, and record the resulting audio. Result: The resulting raw audio will have the same overall performance and speech as the source speaker, but rendered completely naturally in the target speaker's voice. |
You don't send your speech over the line, instead you send some parameters over the line which are then, at the receiving end, fed into a white(-ish) noise generator to recover the speech.
Edit: not by using a neural net or deep learning, of course.