| I'm not an expert on machine learning or DSP, but I do know just enough of each to suspect this isn't anywhere near as impressive as it seems. A distortion pedal is essentially just a waveshaper [1]. Think of audio in digital terms as just a series of numbers. A waveshaper is just a simple mathematical function. To apply it, you literally just apply the function to each value in the input stream and there's your output stream. There's no memory or interesting algorithms going on. It's the audio equivalent to calling map() on your list of samples with some lambda to produce a new list of samples. Of course distortion pedals do that in the analogue domain using circuitry, which has some additional complexity because transistors and diodes and friends don't behave exactly like mathematical functions. There's "sag" and some other physical effects that cause the output to also somewhat depend on previous input. Even so, that can generally be modelled using a simple convolution. Each output sample is calculated by taking some finite number of previous input samples, multiplying each of them by a weight factor, and then summing the results. Does that sound like a neural net? It is. That's what we call them convolutional neural networks. Convolution is bread and butter in DSP. You can easily generate one that produces the same effect as some piece of hardware or acoustic environment by running an impulse (a single 1.0 sample surrounded by silence) through the system and then recording the result. That "impulse response" essentially is your set of convolution weights. So using a deep neural network and then training sounds a lot to me like overkill to me. You could accomplish much the same by using a "depth-1 network" and running an impulse through it. Caveat, though: I am just a novice here, so there could very well be a lot of subtlety I'm missing out on. [1]: https://en.wikipedia.org/wiki/Waveshaper |
An impulse response will characterize only a system that is
* linear
* time-invariant
Many effects are not linear (especially distortion: the crunchiness comes from the nonlinearity). f(a) + f(b) != f(a+b)
And many effects are time varying, for example phasers and choruses which have low frequency oscillators controlling how the sound is shaped depending on when it comes in. Chorus for example will vary the pitch up and down.