Hacker News new | ask | show | jobs
by jononor 2397 days ago
As an introduction introduction I guess this is OK. However there are two major limitations:

1: The feature extraction ends with mean-summarizing across the entire audio clip - leaving no temporal information. This only works well for simple tasks. At least mentioning something about analysis windows and temporal modelling would be good, as the natural next step. Be it LSTM/GRU on the MFCC, or CNN on mel-spectrogram.

2: The folds of the Urbansound8k dataset are not respected in the evaluation. In Urbansound8k different folds contains clips extracted from the same original audio files, usually very close in time. So mixing the folds for the testset means it is no longer entirely "unseen data". The model very likely exploits this data leakage, as the reported accuracy is above SOTA (for no data-augmentation) - unreasonable given the low fidelity feature representation. At least mentioning this limitation and that the performance number they give cannot be compared with other methods, would be prudent.

When I commented similarly on r/machinelearning the authors acknowledged these weaknesses, but did not update the article to reflect it.

1 comments

we're working on another version fixing the folds issue on Urbansound8k and will update the article asap.
Nice!
just to clarify - are you referring to this experiment? https://www.comet.ml/demo/urbansound8k/be09e32700cd435fb6b55...
Sure, that demonstrates the issue. Problem is with using train_test_split(X, yy, test_size=0.2..) - this assumes independent samples, which is violated for this dataset (because some come from same source audio files). The easiest (and completely acceptable) is to use one fold as the validation data, one fold for the test set, and the remaining folds as training.

This problem is unfortunately quite common even in academic papers using this dataset, even though the authors warn about it.

EDIT: There is one more issue with Urbansound8k folds, and that is that the difficulty of the various folds is quite different. So one should ideally report the performance across all folds (mean/std or boxplot). But this is a minor issue compared to data leakage.

PS: Nice use of Comet.ml platform this, collaborating online on improving the experimental setup :)

Hey jononor — we've updated the post to split the training and test sets based on the folds. Good catch and thanks again for reporting this. Some of the experiments in the project will still have the old code, but the blog post will reflect this new train/test split.
Nice. Did you update the reported results also? I think they will change quite a bit