|
|
|
|
|
by volkuleshov
3287 days ago
|
|
I'm one of the authors of the paper that proposes the deep learning model implemented in the blog post, and I would recommend training on a different dataset, such as VCTK (freely available, and what we used in our paper). Super-resolution methods are very sensitive to the choice of training data. They will overfit seemingly insignificant properties of the training set, such as the type of low-pass filter you are using, or the acoustic conditions under which the recordings were made (e.g. distance to the microphone when recording a speaker). To capture all the variations present in the TED talks dataset, you would need a very large model and probably train it for >10 epochs. The VCTK dataset is better in this regard. For comparison, here are our samples: kuleshov.github.io/audio-super-res/ I'm going to try to release the code over the weekend. |
|
Indeed, the TED dataset has a lot of variability in terms of audio quality, etc. which, as you mentioned, with just 10 epochs of training is difficult to capture. I did try a larger network (up to 11 downsampling layers), but this proved even more time consuming to train (as expected). Thus, I split the difference and went with a network similar to yours but was trainable over a four-day period (10 epochs).