|
|
|
|
|
by jononor
2397 days ago
|
|
As an introduction introduction I guess this is OK. However there are two major limitations: 1: The feature extraction ends with mean-summarizing across the entire audio clip - leaving no temporal information. This only works well for simple tasks. At least mentioning something about analysis windows and temporal modelling would be good, as the natural next step. Be it LSTM/GRU on the MFCC, or CNN on mel-spectrogram. 2: The folds of the Urbansound8k dataset are not respected in the evaluation. In Urbansound8k different folds contains clips extracted from the same original audio files, usually very close in time. So mixing the folds for the testset means it is no longer entirely "unseen data". The model very likely exploits this data leakage, as the reported accuracy is above SOTA (for no data-augmentation) - unreasonable given the low fidelity feature representation.
At least mentioning this limitation and that the performance number they give cannot be compared with other methods, would be prudent. When I commented similarly on r/machinelearning the authors acknowledged these weaknesses, but did not update the article to reflect it. |
|