Hacker News new | ask | show | jobs
by lunixbochs 1775 days ago
I'm an ASR researcher shipping high quality English models trained on limited resources, and while I've needed to include other datasets to make the model more robust to different kinds of text, Common Voice is a substantial part of my training process. I did not do any manual transcript accuracy cleanup. Most of my automated cleanup was done with very basic (low quality) models. My latest models trained this way are competitive with e.g. Google or Apple English speech recognition accuracy.

I'm going to disagree that there's a universal need for perfect training data in ASR. I'm sure it helps with some model types and training processes, but it simply hasn't been a factor in my use of Common Voice (English). I'll also note my best model can hit around 10% WER on Common Voice Test without any language model, which is better than any public numbers I've seen posted for it so far (I'm not even using a separate transformer decoder or RNN decoder layers for this number, just the raw output of CTC greedy decode).

None of the above even factors in techniques like wav2vec and IPL (iterative pseudo labeling) with noisy student, which suggest you can hit extremely competitive accuracy with very little correctly labeled data. These techniques are the underpinnings of the current state of the art models.