Hacker News new | ask | show | jobs
by volkuleshov 3287 days ago
I'm one of the authors of the paper that proposes the deep learning model implemented in the blog post, and I would recommend training on a different dataset, such as VCTK (freely available, and what we used in our paper).

Super-resolution methods are very sensitive to the choice of training data. They will overfit seemingly insignificant properties of the training set, such as the type of low-pass filter you are using, or the acoustic conditions under which the recordings were made (e.g. distance to the microphone when recording a speaker).

To capture all the variations present in the TED talks dataset, you would need a very large model and probably train it for >10 epochs. The VCTK dataset is better in this regard.

For comparison, here are our samples: kuleshov.github.io/audio-super-res/

I'm going to try to release the code over the weekend.

1 comments

Thanks for commenting and the suggestion!

Indeed, the TED dataset has a lot of variability in terms of audio quality, etc. which, as you mentioned, with just 10 epochs of training is difficult to capture. I did try a larger network (up to 11 downsampling layers), but this proved even more time consuming to train (as expected). Thus, I split the difference and went with a network similar to yours but was trainable over a four-day period (10 epochs).