| It's a metric that's hard to nail down because there is so much parameter space that you are flattening into one number. Also it doesn't address the "I care about these five high value words (that are made up), can you recognize them?" like product names and company names. There's ~4 types of audio: Phone call
- close microphone
- conversational
- low bandwidth audio
- two way conversation
- more industry specific terminology Meetings
- 2-5 people
- conversational
- far away mic
- better bandwidth audio
- more industry specific terminology Broadcast
- usually good diction
- close mic
- good bandwidth audio
- more general terminology Command&Control (saying to your phone: "go to <this address>")
- close mic or array or mics far away
- short audio chunks, 2-10 seconds
- spoken in a way that makes it easier to recognize (learned behavior)
- usually a lot of widely known named entities are said In that full aggregated line up I bet we'd be in the 22-24% WER pack. That'd mostly be because we focus only on phone calls and meetings. We don't try to improve command&control/broadcast/podcast type yet. Broadcast because it's perceived as lower value (so customers tend not to pay for good recognition for it [we do train models to make them better for specific customers/verticals(usually a reduction of errors by 20-40%), but the buyer has to have a budget for it for now, but there are ways to make it cheaper in the long term]), command and control because you have to have a fleet of devices out in the field collecting data and driving use cases and we don't have customers there yet. |
In terms of gathering data, I'm curious how to plan to get the 15K audio hours it takes to train each of these models. The most you want to segment it (like through acoustic environment or genders), the more data you need. Do you have a cheap way of generating high quality data?