Hacker News new | ask | show | jobs
by albertzeyer 3261 days ago
Very interesting. I was not aware that there is Mozilla DeepSpeech (which implements the model from the same called paper DeepSpeech by Baidu, in TensorFlow). Note that the issue with DeepSpeech (the CTC model from the Baidu paper) is that it really needs a lot of training data to perform well (that is a generic property of CTC). If you use more conventional models (hybrid NN/HMM models), you can get very decent word-error-rate performance with only a few hundred hours of data. The advantage of DeepSpeech of course is that it is simpler and you don't need a lexicon (mapping words to its pronunciations, i.e. sequence of phonemes).

I would also not use voice technology as the generic term for speech recognition, text-to-speech, and whatever else you want to do with this data. Rather, speech technology is the common term to cover all of this (https://en.wikipedia.org/wiki/Speech_technology).

1 comments

Noted. Again thanks for the feedback :)