|
|
|
|
|
by dylanbfox
1719 days ago
|
|
Great question. This is technically referred to as "Wake Word Detection". You run a really small model locally that is just processing 500ms (for example) of audio at a time through a light weight CNN or RNN. The idea here is that it's just binary classification (vs actual speech recognition). There are some open source libraries that make this relatively easy: - https://github.com/Kitt-AI/snowboy (looks to be shutdown now)
- https://github.com/cmusphinx/pocketsphinx This avoids having to stream audio 24x7 to a cloud model which would be super expensive. This being said, I'm pretty sure what the Alexa does, for example, is send any positive wake word to a cloud model (that is bigger and more accurate) to verify the prediction of the local wake word detection model AFAIK. Once you are positive you have a positive wake word detected - that's when you start streaming to an accurate cloud based transcription model like Assembly to minimize costs! |
|