| This is a common viewpoint. Have you used Echo/Alexa and seen what people do with it? "Alexa make an entry on my calendar for lunch with Guillermo, Brian, and Kyle next week Wednesday at noon at Giordano's on Ohio street in Chicago". From 10-15 feet away, often with all kinds of noise, echo, who knows what. A child mumbling french can get within range of an Echo device and do this (with varying degrees of success). Yes a lot of that is handled on device in the audio frontend and elsewhere but it often still bleeds through and makes the fundamental speech recognition challenging. Not to mention bring your accent/voice/speech pattern. That's firmly Whisper territory and doesn't even get into the flexible grammar, integrations, etc with entire other stacks. Plus, many hundreds of millions of dollars and nearly a decade later Alexa still struggles with this. |
However, wouldn't your described use-case be an activity that occurs after wakeword activation? Then handoff the rest of the audiostream to Whisper for transcription?