|
|
|
|
|
by dumbfoundded
2741 days ago
|
|
Hi! Thanks for sharing and I have a few questions. - How does your WER compare to other engines? https://medium.com/descript/which-automatic-transcription-se... - How do you gather data? - Where do you see your long-term differentiation? Is it the features you build on top of other engines or is it the engine itself? Disclaimer: I led engineering for temi.com (a competitor of your's) but am no longer affiliated with it. |
|
There's ~4 types of audio:
Phone call - close microphone - conversational - low bandwidth audio - two way conversation - more industry specific terminology
Meetings - 2-5 people - conversational - far away mic - better bandwidth audio - more industry specific terminology
Broadcast - usually good diction - close mic - good bandwidth audio - more general terminology
Command&Control (saying to your phone: "go to <this address>") - close mic or array or mics far away - short audio chunks, 2-10 seconds - spoken in a way that makes it easier to recognize (learned behavior) - usually a lot of widely known named entities are said
In that full aggregated line up I bet we'd be in the 22-24% WER pack. That'd mostly be because we focus only on phone calls and meetings. We don't try to improve command&control/broadcast/podcast type yet. Broadcast because it's perceived as lower value (so customers tend not to pay for good recognition for it [we do train models to make them better for specific customers/verticals(usually a reduction of errors by 20-40%), but the buyer has to have a budget for it for now, but there are ways to make it cheaper in the long term]), command and control because you have to have a fleet of devices out in the field collecting data and driving use cases and we don't have customers there yet.