|
|
|
|
|
by kkielhofner
968 days ago
|
|
Thanks! Yes, that's exactly what we do[0] (just like the commercial stuff). Wake word and VAD are low-resource and even an ESP chip can handle that + stream. The ESP-BOX-3 is actually our main target device for voice hardware interface. It's the nearly infinite audio, speech, grammar, language, etc variability and complexity where you need the "big guns". Another thing that seems to be getting lost on people - user expectations for voice interfaces are pretty high. If wake fails, a transcript is wrong, speech rec is slow, etc it's easier, faster, and far less frustrating to just take your phone out of your pocket. At that point why even have something poorly attempting to do voice? [0] - https://heywillow.io/how-willow-works/#willow-inference-serv... |
|
Do you see an eventual future where some notional "model-on-chip" would hard-wire something like whisper into a dedicated integrated low-power chip for these more demanding uses?