|
|
|
|
|
by perfopt
1423 days ago
|
|
This is awesome. One question I have always had - is the research on applying DL for images the most developed compared to other things? Even DL used for audio processing (classification, separation etc) seems to convert audio to spectral graphs and apply DL to that. Changing a problem to be expressed as image inputs will be an advantage when using DL as a solution. Would you agree ? |
|
Take convolutional models, for example. Very effective for working with images because they're (a) parameter efficient, (b) learn local/spatial correlations in input features, and (c) exploit translational invariance. As an oversimplification, we can train models to visually identify "things" in images by their edges.
If you think about what's going on with an audio spectrogram, you can see the same concepts at work. There's local/spatial correlation - certain sounds tend to have similar power coefficients in similar frequency buckets. These are also correlated in time (because the pitch envelope of the word "yes" tends to have the same shape), and convolutional models can also exploit time-invariance (in the sense that convolutional models can learn the word "yes" from samples where the word appears with varying amounts of silence to the left and right).
That being said, the addition of the time domain makes audio quite hard to work with, and (usually) not as simple as just running a spectrogram through a vanilla image classification model. But it's definitely enlightening to think about how these models are "learning".