I hesitate to even post this, but I listened to the audio examples and it seems like this project was not yet a success. I'm not trying to be a jerk or snarky, but the reconstructed audio sounded terrible.
I have to agree. There's certainly more high-frequency content, but it seems mostly like noise, with only a vague amplitude correlation to the existing audio.
I'd be curious to see if any better results could be obtained by applying a similar technique in the frequency domain.
Opus does something similar in the frequency domain, but much simpler. It just copies codewords (minus energy) from lower bands to higher bands. It also lets you signal the energy of the copied bands (something you wouldn't have access to if reconstructing blindly).
"applying a similar technique in the frequency domain", "Maybe training an image reconstructor on the short term spectrogram" - This is what I originally thought to do. However, this approach suffers from information loss whenever you transform from the frequency domain back to the time domain. Since the goal was super-resolution in the time domain, working in the time domain is more sensible.
Mathematically, the DFT is invertable, ie lossless, but practically there will be a bit of loss due to the finite precision of float point numbers. Even though it isn't lossless, the amount of loss should be miniscule as compared to the 16KHz->2KHz loss you are trying to overcome.
It's not precision loss, it's that when you DFT you choose an interval. If you choose a short interval you are less certain about frequencies while if you choose a long interval you are less certain about time domain changes (i.e, changes in the signal over your time period).
When using DL, perhaps you might try to do a downsampling which would suite your DL?
I mean yes it would be awesome to use your network to upsample stuff but that is apperently hard. What about upsampling something DL friendly and trying to reduce the downsamplesize as the challange?
Since time domain content is the reconstruction target, wouldn't LSTMs be a better choice than CNNs? I would think the spectral content would be time variant and depend on the sequential history.
My thinking is this is a good GAN problem. L2 norm will have these bad trivial upscalings as local minima - since L2 in time domain is the same as L2 in frequency domain, you can think in the frequency domain that it basically has this big black area to infill from very little information. If you had some sort of perceptual similarity, on the other hand, there will be lots of adjacent improvements in quality that will reduce the error and make it easier to train. I think this matches the results seen in image upscaling, too.
In fact, when you listen to the downsampled example, there is actually a lot of information in the extract. Way more than enough. That's because the frequency should be in log scale to be more relevant to the human hear.
Here the frequency cutoff is 2 Khz, which is already a fairly high pitch.
I completely agree. You could even tell from the picture of the spectrum above that the "reconstruction" was not a success. The spectrum looks like it's been reconstructed by a flat extrapolation of the amplitude of the last frequency known.
This is exactly the kind of projects where deep models should excel. Something's not working properly here.
"Something's not working properly here" - I disagree. The model will overtrain (i.e. perfectly reconstruct the original waveforms of a small training set), which indicates it's capable of learning the necessary transformation. The problem lies in the limited amount of training time I had. To reiterate from an earlier comment, I trained on only 10 epochs, while the paper this is base on claimed to train on 400. Much more training is required for this model to generalize well without degrading the signal-to-noise ratio.
Hey thanks for the comment. That makes sense, I'd be interested in hearing the results with more training. This could work well and have a good range of applications.
"the reconstructed audio sounded terrible" - I think this is referring to the amount of static noise in the reconstructed waveform. Indeed, the SNR clearly shows the reconstruction is slightly worse than the downsampled waveform. As mentioned in the post, I strongly believe this is due to the limited amount of training I performed. The number of epochs of training data in my case was only 10 while the paper this project is based on trained for 400 epochs. During training I noticed a strong dependence on training epochs and perceptual performance.
My ears think so too, but upsampling by just 2 is roughly the same difficulty as upsampling an image by 2. As you probably know, you can't just CIA-like "ENHANCE" an image to double its resolution and expect its noise level to be lower than, I don't know, 10 decibels (of image brightness). Yet our ears can notice noise as low as 40-50 decibels, so it would be nearly impossible to reconstruct higher frequencies so that the result has no noticeable noise.
In this research, the author is attempting to upsample by a bit more than 2.
That's the point of using deep learning here. Of course you can't make up the missing information, but by training the model with a lot of samples, it should eventually reach a point where it produces the most likely original information.
> you can't just CIA-like "ENHANCE" an image to
> double its resolution
I think what you're saying is that if the high frequency information is gone, it's gone? But that shouldn't matter.. we don't need it to be identical to the original. It just needs to sound identical to the original.
If you hit a snare drum 5 times in a row, the high frequency data of each hit will differ wildly, and yet a human won't be able to tell the difference.
I'd be curious to see if any better results could be obtained by applying a similar technique in the frequency domain.