| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by eggoa 3287 days ago
	I hesitate to even post this, but I listened to the audio examples and it seems like this project was not yet a success. I'm not trying to be a jerk or snarky, but the reconstructed audio sounded terrible.

6 comments

seandougall 3287 days ago

I have to agree. There's certainly more high-frequency content, but it seems mostly like noise, with only a vague amplitude correlation to the existing audio.

I'd be curious to see if any better results could be obtained by applying a similar technique in the frequency domain.

link

TD-Linux 3287 days ago

Opus does something similar in the frequency domain, but much simpler. It just copies codewords (minus energy) from lower bands to higher bands. It also lets you signal the energy of the copied bands (something you wouldn't have access to if reconstructing blindly).

See "band folding" here: https://people.xiph.org/~xiphmont/demo/celt/demo.html

link

jhetherly 3287 days ago

hey, author here

Thanks for the feedback.

"applying a similar technique in the frequency domain", "Maybe training an image reconstructor on the short term spectrogram" - This is what I originally thought to do. However, this approach suffers from information loss whenever you transform from the frequency domain back to the time domain. Since the goal was super-resolution in the time domain, working in the time domain is more sensible.

link

tasty_freeze 3287 days ago

Mathematically, the DFT is invertable, ie lossless, but practically there will be a bit of loss due to the finite precision of float point numbers. Even though it isn't lossless, the amount of loss should be miniscule as compared to the 16KHz->2KHz loss you are trying to overcome.

link

volkuleshov 3287 days ago

The problem with the DFT is not whether it's lossless or not, it's that it may not be the best feature representation for a given task.

Both the DFT and the proposed model apply convolutions to the input, but in the former case, these are fixed, while in the latter, they are learned.

This is similar to how we don't use hard-coded features like SIFT or wavelets, or Gabor filters when we do image classification with a CNN.

link

zxcmx 3287 days ago

It's not precision loss, it's that when you DFT you choose an interval. If you choose a short interval you are less certain about frequencies while if you choose a long interval you are less certain about time domain changes (i.e, changes in the signal over your time period).

Funnily enough this is similar to heisenberg's uncertainty principle, you can read about it here: http://fourier.eng.hmc.edu/e101/lectures/Fourier_Analysis/no...

link

sigi45 3287 days ago

When using DL, perhaps you might try to do a downsampling which would suite your DL?

I mean yes it would be awesome to use your network to upsample stuff but that is apperently hard. What about upsampling something DL friendly and trying to reduce the downsamplesize as the challange?

link

hcrisp 3287 days ago

Since time domain content is the reconstruction target, wouldn't LSTMs be a better choice than CNNs? I would think the spectral content would be time variant and depend on the sequential history.

link

murbard2 3287 days ago

I'd be more interested in seeing it applied to reduce MP3 or AAC artifacts.

link

d--b 3287 days ago

I thought that too. In my opinion the results would be _much_ better by working in frequency space.

Maybe training an image reconstructor on the short term spectrogram is a good start.

link

highd 3287 days ago

My thinking is this is a good GAN problem. L2 norm will have these bad trivial upscalings as local minima - since L2 in time domain is the same as L2 in frequency domain, you can think in the frequency domain that it basically has this big black area to infill from very little information. If you had some sort of perceptual similarity, on the other hand, there will be lots of adjacent improvements in quality that will reduce the error and make it easier to train. I think this matches the results seen in image upscaling, too.

link

d--b 3287 days ago

In fact, when you listen to the downsampled example, there is actually a lot of information in the extract. Way more than enough. That's because the frequency should be in log scale to be more relevant to the human hear.

Here the frequency cutoff is 2 Khz, which is already a fairly high pitch.

link

usaphp 3287 days ago

Yeah but it used "deep learning" so its going straight to the front page, no matter of the result...

link

angry_octet 3286 days ago

That is an incredibly useless comment. DL is a new and poorly understood technology, it is obvious that not everything will be perfect.

link

d--b 3287 days ago

I completely agree. You could even tell from the picture of the spectrum above that the "reconstruction" was not a success. The spectrum looks like it's been reconstructed by a flat extrapolation of the amplitude of the last frequency known.

This is exactly the kind of projects where deep models should excel. Something's not working properly here.

link

jhetherly 3287 days ago

hey, author here

Thanks for the feedback.

"Something's not working properly here" - I disagree. The model will overtrain (i.e. perfectly reconstruct the original waveforms of a small training set), which indicates it's capable of learning the necessary transformation. The problem lies in the limited amount of training time I had. To reiterate from an earlier comment, I trained on only 10 epochs, while the paper this is base on claimed to train on 400. Much more training is required for this model to generalize well without degrading the signal-to-noise ratio.

link

d--b 3286 days ago

Hey thanks for the comment. That makes sense, I'd be interested in hearing the results with more training. This could work well and have a good range of applications.

link

jhetherly 3287 days ago

hey, author here

Thanks for the feedback.

"the reconstructed audio sounded terrible" - I think this is referring to the amount of static noise in the reconstructed waveform. Indeed, the SNR clearly shows the reconstruction is slightly worse than the downsampled waveform. As mentioned in the post, I strongly believe this is due to the limited amount of training I performed. The number of epochs of training data in my case was only 10 while the paper this project is based on trained for 400 epochs. During training I noticed a strong dependence on training epochs and perceptual performance.

link

vortico 3286 days ago

My ears think so too, but upsampling by just 2 is roughly the same difficulty as upsampling an image by 2. As you probably know, you can't just CIA-like "ENHANCE" an image to double its resolution and expect its noise level to be lower than, I don't know, 10 decibels (of image brightness). Yet our ears can notice noise as low as 40-50 decibels, so it would be nearly impossible to reconstruct higher frequencies so that the result has no noticeable noise.

In this research, the author is attempting to upsample by a bit more than 2.

link

d--b 3286 days ago

That's the point of using deep learning here. Of course you can't make up the missing information, but by training the model with a lot of samples, it should eventually reach a point where it produces the most likely original information.

It works quite well on images: https://github.com/alexjc/neural-enhance

link

charlesism 3286 days ago

    > you can't just CIA-like "ENHANCE" an image to 
    > double its resolution

I think what you're saying is that if the high frequency information is gone, it's gone? But that shouldn't matter.. we don't need it to be identical to the original. It just needs to sound identical to the original.

If you hit a snare drum 5 times in a row, the high frequency data of each hit will differ wildly, and yet a human won't be able to tell the difference.

link