Hacker News new | ask | show | jobs
by stephensonsco 2739 days ago
It's a metric that's hard to nail down because there is so much parameter space that you are flattening into one number. Also it doesn't address the "I care about these five high value words (that are made up), can you recognize them?" like product names and company names.

There's ~4 types of audio:

Phone call - close microphone - conversational - low bandwidth audio - two way conversation - more industry specific terminology

Meetings - 2-5 people - conversational - far away mic - better bandwidth audio - more industry specific terminology

Broadcast - usually good diction - close mic - good bandwidth audio - more general terminology

Command&Control (saying to your phone: "go to <this address>") - close mic or array or mics far away - short audio chunks, 2-10 seconds - spoken in a way that makes it easier to recognize (learned behavior) - usually a lot of widely known named entities are said

In that full aggregated line up I bet we'd be in the 22-24% WER pack. That'd mostly be because we focus only on phone calls and meetings. We don't try to improve command&control/broadcast/podcast type yet. Broadcast because it's perceived as lower value (so customers tend not to pay for good recognition for it [we do train models to make them better for specific customers/verticals(usually a reduction of errors by 20-40%), but the buyer has to have a budget for it for now, but there are ways to make it cheaper in the long term]), command and control because you have to have a fleet of devices out in the field collecting data and driving use cases and we don't have customers there yet.

1 comments

I guess maybe a better way to ask is which acoustic environments do you excel in?

In terms of gathering data, I'm curious how to plan to get the 15K audio hours it takes to train each of these models. The most you want to segment it (like through acoustic environment or genders), the more data you need. Do you have a cheap way of generating high quality data?

I didn't answer "Do you have a cheap way of generating high quality data?". We have good ways to do it. They're not that cheap though. It's expensive (organizationally and real $$$) to label large amounts of data no matter what.

But we do utilize our capabilities to better tackle the wild data gathering and labeling. For instance, "is every labeled minute just as valuable as any other?". Definitely not. So if you can find and select only the data you want to label, rather than indiscriminately labeling a bunch, then you can increase your overall efficacy.

If you're training from scratch around 10k hours is needed to get a good model, but when you are transfer learning you don't need nearly that much (100 hours gets you a lot).

We excel in phone call and meetings settings. I.e. the typical sales/office/support environment.

Baidu trained their DeepSpeech model with 6000 hours of English to get a model similarly accurate to Google/Microsoft, it may just be the type of quick model your using that needs 10k hours to achieve good results.

Mozilla's DeepSpeech is quite interesting, languages like Turkish can get a decently usable (~20% WER) model with just 80hrs of training data (no transfer learning, starting from a clean slate).

Yep, all good points. One thing to consider is that generalization is a big problem. It's easy to get good on a specific dataset nowadays (like 5-10% word error rate level on academic datasets), but that same model might do 40% WER on data in the wild.