Hacker News new | ask | show | jobs
by dabinat 1775 days ago
Common Voice is a great project that I’m glad Mozilla kept alive.

One problem is that data for speech recognition needs to be extremely accurate (i.e. the speech matches the transcript perfectly) and the human review process is infallible and there are quite a number of bad clips that made it past the review process (to be fair, Mozilla provides no official guidance to reviewers or recorders).

Plus in the early days, they were recording the same small sentence pool over and over again, so the first 700 hours or so are duplicates.

I hope there will be efforts in the future to clean up the existing dataset to improve its quality.

3 comments

I'm an ASR researcher shipping high quality English models trained on limited resources, and while I've needed to include other datasets to make the model more robust to different kinds of text, Common Voice is a substantial part of my training process. I did not do any manual transcript accuracy cleanup. Most of my automated cleanup was done with very basic (low quality) models. My latest models trained this way are competitive with e.g. Google or Apple English speech recognition accuracy.

I'm going to disagree that there's a universal need for perfect training data in ASR. I'm sure it helps with some model types and training processes, but it simply hasn't been a factor in my use of Common Voice (English). I'll also note my best model can hit around 10% WER on Common Voice Test without any language model, which is better than any public numbers I've seen posted for it so far (I'm not even using a separate transformer decoder or RNN decoder layers for this number, just the raw output of CTC greedy decode).

None of the above even factors in techniques like wav2vec and IPL (iterative pseudo labeling) with noisy student, which suggest you can hit extremely competitive accuracy with very little correctly labeled data. These techniques are the underpinnings of the current state of the art models.

Here are some draft guidelines for validation that have been translated a lot: https://discourse.mozilla.org/t/discussion-of-new-guidelines...

But you are right, the process has some flaws. Maybe we can review the dataset automatically on some common errors, once an STT system is ready for a language?

The only other option I can think about is a validation process that includes more people per sentence. Right now, only two people validate a sentence, and if they disagree a third person decides. We could at least double check sentences with one "no" vote one more time.

The community guidelines are good but they’re hidden away on the forum. I was asking them for years to just make those the official guidelines and link them prominently on the CV site but they never did.

However, Hillary, the new community manager, seems good and she’s making a lot of positive changes so hopefully this will be addressed soon.

Long-term the best approach may be some kind of user onboarding before they can record / validate.

Hey,

Thank you for the compliment and feedback.

Following community feedback voice validation criteria is now available on Common Voice platform (released as part of the recent dataset).

This is one of many steps we are making to improve Common Voice contributors and everyone using the dataset.

Why does data for speech recognition need to be prefect. That's certainly not the case for other machine learning applications. Can you train the less clean data and fine-tune on a clean subset?
Well that was kind of my point: you need to manually figure out what’s clean and what isn’t.
But it's easy to do that for a small subset for finetuning compared to cleaning up the entire dataset.