|
|
|
|
|
by dabinat
1775 days ago
|
|
Common Voice is a great project that I’m glad Mozilla kept alive. One problem is that data for speech recognition needs to be extremely accurate (i.e. the speech matches the transcript perfectly) and the human review process is infallible and there are quite a number of bad clips that made it past the review process (to be fair, Mozilla provides no official guidance to reviewers or recorders). Plus in the early days, they were recording the same small sentence pool over and over again, so the first 700 hours or so are duplicates. I hope there will be efforts in the future to clean up the existing dataset to improve its quality. |
|
I'm going to disagree that there's a universal need for perfect training data in ASR. I'm sure it helps with some model types and training processes, but it simply hasn't been a factor in my use of Common Voice (English). I'll also note my best model can hit around 10% WER on Common Voice Test without any language model, which is better than any public numbers I've seen posted for it so far (I'm not even using a separate transformer decoder or RNN decoder layers for this number, just the raw output of CTC greedy decode).
None of the above even factors in techniques like wav2vec and IPL (iterative pseudo labeling) with noisy student, which suggest you can hit extremely competitive accuracy with very little correctly labeled data. These techniques are the underpinnings of the current state of the art models.