|
|
|
|
|
by harisankarh
2600 days ago
|
|
The authors propose an unsupervised encoder for ASR. The encoder is trained using an interesting upstream task of predicting whether an audio portion or clip succeeds the current one or not. The authors report superior overall accuracy results even surpassing the massively trained Deepspeech 2 model in certain datasets. The authors perform insightful characterization and ablation studies and report results. The approach seems to provide significant accuracy boost when the supervised training set available is small, e.g., less than 10 hours. The relative improvement is modest over baseline supervised model trained on 10s of hours of transcribed audio. The trends indicate that the improvement is probably minimal when 100s of hours of supervised training data is available. The authors report improvements over Deepspeech on certain datasets. Deepspeech uses a 5-gram language model. The proposed model has significantly lower performance (albeit on a smaller supervised training set) when it also uses an n-gram-based language model. Improvements over Deepspeech are shown when convolutional language models are used. Hence, it is possible that the improvements over Deepspeech are contributed mainly by the use of convolutional language models. Comparing with Deepspeech+conv language model will provide a better apple-to-apple comparison of the proposed unsupervised pre-trained acoustic model. The gains also seem to have diminishing returns as the number of hours of unsupervised training data increases (improvement is marginal even with 10x increase of unsupervised training data). |
|