Hacker News new | ask | show | jobs
by dumbfoundded 2738 days ago
I guess maybe a better way to ask is which acoustic environments do you excel in?

In terms of gathering data, I'm curious how to plan to get the 15K audio hours it takes to train each of these models. The most you want to segment it (like through acoustic environment or genders), the more data you need. Do you have a cheap way of generating high quality data?

2 comments

I didn't answer "Do you have a cheap way of generating high quality data?". We have good ways to do it. They're not that cheap though. It's expensive (organizationally and real $$$) to label large amounts of data no matter what.

But we do utilize our capabilities to better tackle the wild data gathering and labeling. For instance, "is every labeled minute just as valuable as any other?". Definitely not. So if you can find and select only the data you want to label, rather than indiscriminately labeling a bunch, then you can increase your overall efficacy.

If you're training from scratch around 10k hours is needed to get a good model, but when you are transfer learning you don't need nearly that much (100 hours gets you a lot).

We excel in phone call and meetings settings. I.e. the typical sales/office/support environment.

Baidu trained their DeepSpeech model with 6000 hours of English to get a model similarly accurate to Google/Microsoft, it may just be the type of quick model your using that needs 10k hours to achieve good results.

Mozilla's DeepSpeech is quite interesting, languages like Turkish can get a decently usable (~20% WER) model with just 80hrs of training data (no transfer learning, starting from a clean slate).

Yep, all good points. One thing to consider is that generalization is a big problem. It's easy to get good on a specific dataset nowadays (like 5-10% word error rate level on academic datasets), but that same model might do 40% WER on data in the wild.