| HN Mirror

These are the "inputs" I'm talking about recovering (from the link):

"In order to use WaveNet to turn text into speech, we have to tell it what the text is. We do this by transforming the text into a sequence of linguistic and phonetic features (which contain information about the current phoneme, syllable, word, etc.) and by feeding it into WaveNet."

The raw audio from Step 3 was (in principle) generated by that input on a properly trained WaveNet. We need to recover that so we can transfer it to the target WaveNet.

How a specific WaveNet instance is configured (as you point out, it's part of the model parameters) is an implementation detail that is irrelevant for the steps I proposed.