| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by erichocean 3571 days ago

This can be used to implement seamless voice performance transfer from one speaker to another:

1. Train a WaveNet with the source speaker.

2. Train a second WaveNet with the target speaker. Or for something totally new, train a WaveNet with a bunch of different speakers until you get one you like. This becomes the target WaveNet.

3. Record raw audio from the source speaker.

Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse" to recover those inputs given the rendered output. In this case, we now have raw audio from the source speaker that—in principle— could have been rendered by the source speaker's WaveNet, and we want to recover the inputs that would have rendered it, had we done so.

To do that, usually you convert all numbers in the forward renderer into Dual numbers and use automatic differentiation to recover the inputs (in this case, phonemes and what not).

4. Recover the inputs. (This is computationally expensive, but not difficult in practice, especially if WaveNet's generation algorithm is implemented in C++ and you've got a nice black-box optimizer to apply to the inputs, of which there are many freely available options.)

5. Take the recovered WaveNet inputs, feed them into the target speaker's WaveNet, and record the resulting audio.

Result: The resulting raw audio will have the same overall performance and speech as the source speaker, but rendered completely naturally in the target speaker's voice.

8 comments

itcrowd 3571 days ago

Another fun fact: this actually happens with (cell) phone calls.

You don't send your speech over the line, instead you send some parameters over the line which are then, at the receiving end, fed into a white(-ish) noise generator to recover the speech.

Edit: not by using a neural net or deep learning, of course.

JonnieCache 3571 days ago

In case anyone is wondering, the technique is called linear predictive coding.

AstralStorm 3571 days ago

Predictive coding. Linear is a specific variant of it used in older codecs.

VikingCoder 3570 days ago

What's the difference in bandwidth?

svantana 3571 days ago

> Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse"

Now wait a minute, most algorithms cannot be run in reverse! The only general way to reverse an algo is to try all possible inputs, which has exponential complexity. That's the basis of RSA encryption. Maybe you're thinking about automatic differentiation, a general algo to get the gradient of the output w.r.t. the inputs. That allows you to search for a matching input using gradient descent, but that won't give you an exact match for most interesting cases (due to local minima).

I'm not trying to nitpick -- in fact I believe that IF algos were reversible then human-level AI would have been solved a long time ago. Just write a generative function that is capable of outputting all possible outputs, reverse, and inference is solved.

shoo 3571 days ago

This also makes me think of "inverse problems", in the context of mathematics, physics.

E.g. a forward problem might be to solve some PDE to simulate the state of a system from some known initial conditions.

The inverse problem could be to try to reverse engineer what the initial conditions were given the observed state of the system.

Inverse problems are typically much harder to deal with, and much harder to solve. E.g. perhaps they don't have a unique solution, or the solution is a highly discontinuous function of the inputs, which amplifies any measurement errors. In practice this can be addressed by regularisation aka introducing strong structural assumptions about what the expected solution should be like. This can be quite reasonable from a Bayesian perspective.

https://en.wikipedia.org/wiki/Inverse_problem#Mathematical_c...

romaniv 3571 days ago

Maybe I'm reading this paper incorrectly, but it seems that in this system "voice" is part of the model parameters not inputs. What they did was train the same model with multiple reader voices while using one of the inputs to keep track of which voice the model was currently trained on. So the model can switch between different voices, but only between those which it was trained on.

"The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector. The dataset consisted of 44 hours of data from 109 different speakers."

Am I missing something?

erichocean 3571 days ago

These are the "inputs" I'm talking about recovering (from the link):

"In order to use WaveNet to turn text into speech, we have to tell it what the text is. We do this by transforming the text into a sequence of linguistic and phonetic features (which contain information about the current phoneme, syllable, word, etc.) and by feeding it into WaveNet."

The raw audio from Step 3 was (in principle) generated by that input on a properly trained WaveNet. We need to recover that so we can transfer it to the target WaveNet.

How a specific WaveNet instance is configured (as you point out, it's part of the model parameters) is an implementation detail that is irrelevant for the steps I proposed.

swsieber 3571 days ago

Oh, pair this with facial mapping[1] and you pretty much have an "impersonate any famous person" system.

[1] http://www.graphics.stanford.edu/~niessner/thies2016face.htm...

erichocean 3571 days ago

Yup, I work in virtual filmmaking and there are tons of way to use this stuff.

I give us 10-15 years before it's not possible to trust anything you see or hear that's recorded.

StavrosK 3571 days ago

Really? I haven't trusted anything recorded in years.

copperx 3570 days ago

Speech production is incredibly hard to fake at the moment.

sangnoir 3569 days ago

> Speech production is incredibly hard to fake at the moment.

Sound-alikes have been used in the music industry since forever.

mirimir 3571 days ago

Or transmitted from one place to another :(

zardo 3571 days ago

Basically the same idea as style transfer with image algorithms. Looking forward to Abraham Lincoln reading audiobooks to me.

infinite8s 3571 days ago

That would require audio recordings of Abraham Lincoln's voice. Not sure recording technology existed back then.

zardo 3571 days ago

Audio quality does leave something to be desired. https://vimeo.com/47987691

barrkel 3571 days ago

Lincoln died before Edison invented the phonograph. That's a hoax.

Houshalter 3571 days ago

Lincoln died in 1865, but the oldest recordings are from the 1860s. The video is definitely a hoax (http://www.firstsounds.org/research/others/lincoln.php), but it's at least theoretically possible his voice could have been recorded. In fact I believe there are some even older recordings from the 1850s, but I don't think those have been successfully recovered yet.

These early recordings are incredibly crude, and they did not have the technology at the time to play them back. They were just experiments in trying to view sound waves, not attempts to preserve information for future generations.

infinite8s 3571 days ago

Ah I stand corrected, thanks.

dhammack 3571 days ago

It seems like you're using WaveNet to do speech-to-text when we have better tools for that. To transfer text from Trump to Clinton, first run speech-to-text on Trump speech and then give that to a WaveNet trained on Clinton to generate speech that sounds like her but says the same thing as Trump.

erichocean 3571 days ago

> It seems like you're using WaveNet to do speech-to-text

I'm proposing reducing a vocal performance into the corresponding WaveNet input. At no point in that process is the actual "text" recovered, and doing so would defeat the whole purpose, since I don't care about the text, I care about the performance of speaking the text (whatever it was).

In your example, I can't force Trump to say something in particular. But I can force myself, so I could record myself saying something I wanted Clinton to say [Step 3] (and in a particular way, too!), and if I had a trained WaveNet for myself and Clinton, I could make it seem like Clinton actually said it.

dhammack 3571 days ago

I see. I still think it's easier to apply deepmind's feature transform on text rather than to try to invert a neural network. Armed with a network trained on Trump, deepmind's feature transform from text->network inputs, you should be able to make him say whatever you want, right?

Text -> features -> TrumpWaveNet -> Trump saying your text

erichocean 3571 days ago

> Armed with a network trained on Trump, deepmind's feature transform from text->network inputs, you should be able to make him say whatever you want, right?

Yes, that should work, and by tweaking the WaveNet input appropriately, you could also get him to say it in a particular way.

creshal 3571 days ago

Sounds like a very fancy way to do compression with a massive custom dictionary.

posterboy 3571 days ago

Thanks for the tl;dr. However, the fun fact is not true for surjective functions, IIRC, in which case multiple inputs may relate to one output, if this is relevant for WaveNets.

mdup 3571 days ago

Nitpicking: surjective functions do not relate to unicity of ouptuts; you'd rather talk about non-injective functions. I agree with your point, though.

(surjective != non-injective, in the same way that non-increasing != decreasing)