| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by punchingwater 3261 days ago

I can tell from your comment (and it's responses) that the language on our homepage is a bit confusing, so thank you for the feedback.

To answer you question: Common Voice is about building a collection of labelled voice data (ie. sentence clips w/ transcripts) that can be used to, for instance, train speech-to-text algorithms. Part of the goals of this project though is to figure out how this data can best help people build voice technology. So it's pretty open ended at this point.

Mozilla does have an open source speech-to-text engine [1] we are developing, and we hope one day to use the Common Voice data to train this engine. DeepSpeech and Common Voice are related, but separate projects, if that makes sense.

As for LibriSpeech, the DeepSpeech team at Mozilla does use this data for training. However, the language is pretty antiquated, and we only get about 1K hours of data, whereas you need about 10K hours to get to a decent accuracy (WER of 10% and below). Common Voice is about adding to public corpora like LibraSpeech, not replacing them.

1.) https://github.com/mozilla/DeepSpeech

1 comments

albertzeyer 3261 days ago

Very interesting. I was not aware that there is Mozilla DeepSpeech (which implements the model from the same called paper DeepSpeech by Baidu, in TensorFlow). Note that the issue with DeepSpeech (the CTC model from the Baidu paper) is that it really needs a lot of training data to perform well (that is a generic property of CTC). If you use more conventional models (hybrid NN/HMM models), you can get very decent word-error-rate performance with only a few hundred hours of data. The advantage of DeepSpeech of course is that it is simpler and you don't need a lexicon (mapping words to its pronunciations, i.e. sequence of phonemes).

I would also not use voice technology as the generic term for speech recognition, text-to-speech, and whatever else you want to do with this data. Rather, speech technology is the common term to cover all of this (https://en.wikipedia.org/wiki/Speech_technology).

punchingwater 3261 days ago

Noted. Again thanks for the feedback :)