| I’m a big fan of Whisper and whisper.cpp but doing reliable wake word detection with any kind of reasonable latency on a Raspberry Pi is likely to be a poor fit and very bad experience. The Whisper model operates on 30 sec speech chunks. Input audio has to be padded to that length. So you’re constantly going to be recording audio, padding, looking for wake word, and then activating full recording upon detection. All on padded 30 sec chunks looking back… Then there is model size and availability. Whisper base or maybe even tiny could potentially give decent results for wake word detection but I’m skeptical. Wake words can be surprisingly tricky. That’s just for wake word assuming you’re going to stream audio after, as reliably doing ASR and NLP to figure out speaker intent is far too challenging and time consuming to be done on Raspberry Pi class hardware in anything approaching response times that would be considered acceptable. Whisper does pretty well with relatively high noise/low quality speech and far-field microphones are amazing but I doubt that's enough to provide anything approaching Echo/Alexa quality in the real world. This is a carefully scripted demo[0] showing it takes a whopping 15 seconds to wake, ASR the speech, and return the result. The average person could easily take their phone out of their pocket, unlock it, look for the weather app, and read the weather in less time. This demo[1] claims "real time" but looking at the example videos it clearly isn't and the accuracy leaves a lot to be desired. This is with three threads on a Raspberry Pi 4. I just tried asking Echo what the weather is like and it was so fast I had trouble timing it - somewhere around one second. [0] - https://youtube.com/watch?v=Aor6CFkcWzU&si=EnSIkaIECMiOmarE [1] - https://github.com/ggerganov/whisper.cpp/discussions/166 |
[0] https://twitter.com/ggerganov/status/1602759833312456704