Hacker News new | ask | show | jobs
by gok 4198 days ago
So with 300 hours of training data it does worse on SWB than a DNN-HMM, or even a GMM-HMM system? But when they give it 2300 hours or training data, it can beat those 300 hour trained systems?

This is still very cool, but that comparison doesn't seem fair at all.

2 comments

Why not? DNN-HMM and GMM-HMM won;t have done any better even if trained for 2300 hours.
Mostly this, though it's not so black-and-white. The paper discusses results from a DNN-HMM system (Maas et al., using Kaldi) trained on 2k hours, and it does provide a small generalization improvement over 300 hours.

Much of the excitement about deep learning -- which we see as well in DeepSpeech -- is that these models continue to improve as we provide more training data. It's not obvious a priori that results will keep getting better after thousands of hours of speech. We're exited to keep advancing that frontier.

That was an even weirder comparison. They compare a system trained on 2000 hours of acoustic data mismatched with the testing data to their system, which was trained on 300 hours of matched data in addition to the 2000 hours of mismatched acoustic data.
Are any of these systems open source?
Both Kaldi[1] and CMU Sphinx[2] are high-quality open source speech systems. I know for a fact that Kaldi includes support for DNN acoustic models (I'm less familiar with Sphinx).

[1] http://kaldi.sourceforge.net/ [2] http://cmusphinx.sourceforge.net/

Thanks, appreciated, but my dear lord, without a PhD in AI systems these things are a bit beyond what most users, me included, would casually play around with. Be great if this tech made it into Dragon Naturally Speaking-like end product to use privately.