Hacker News new | ask | show | jobs
by el5r 2876 days ago
Looks like Kaldi gets 4.27% WER on CV: https://github.com/kaldi-asr/kaldi/blob/master/egs/commonvoi...

But that's likely trained exclusively on CV audio and using a language model derived from the CV training data so that's also not a fair comparison unless the other engines were trained the same way.

Comparing systems trained on different datasets (with different language models) like this is like comparing apples to oranges. Mixing wildly different CPU and memory requirements into the benchmarks just makes it worse.

It should be fairly straightforward to do an unbiased comparison by training and evaluating with the standard Librispeech split and language model. It might be interesting to see how accuracy improves as the models scale up until they match the resource requirements of the other engines.

That said, the speed and memory usage are impressive and I like the focus on very low resource environments. Seems like it has a lot of potential even if it may not be SOTA.

1 comments

Thanks for the link. I'll be sure to look into it. It is impressive.

A disclaimer is that we use the valid train portion of CV as part of our training set. But it is less than 10% of the train set (in terms of hours). Also, we do not employ an LM mainly because the systems we are targeting do not have enough storage for a strong LM (usually the storage on them maxes out at 64 MB). Cheetah is an end-to-end acoustic model. For later versions, we might be able to add a well-pruned LM for specific domains with limited vocabulary to boost the accuracy with limited storage available.

I fully agree with your points. I am taking notes here as I think we should follow up on a couple of your suggestions. Scaling up to DeepSpeech model size can be a bit tricky as it would require much more compute resources (GPU). But should be quite doable with time and budget.

Thanks again for your comments and suggestions. As you correctly pointed out our main focus is the very low resource (CPU/Memory) embedded systems.