| HN Mirror

Thanks for the link. I'll be sure to look into it. It is impressive.

A disclaimer is that we use the valid train portion of CV as part of our training set. But it is less than 10% of the train set (in terms of hours). Also, we do not employ an LM mainly because the systems we are targeting do not have enough storage for a strong LM (usually the storage on them maxes out at 64 MB). Cheetah is an end-to-end acoustic model. For later versions, we might be able to add a well-pruned LM for specific domains with limited vocabulary to boost the accuracy with limited storage available.

I fully agree with your points. I am taking notes here as I think we should follow up on a couple of your suggestions. Scaling up to DeepSpeech model size can be a bit tricky as it would require much more compute resources (GPU). But should be quite doable with time and budget.

Thanks again for your comments and suggestions. As you correctly pointed out our main focus is the very low resource (CPU/Memory) embedded systems.