Hacker News new | ask | show | jobs
by endisneigh 1710 days ago
If you wanted to do something like "OK Google" with AssemblyAI would you have to transcribe everything and then process the substring "OK Google" on the application layer (and therefore incur all of the cost of listening constantly)?

It'd be cool if there was the ability to train a phrase locally on your own premises and then use that to begin the real transcription.

This probably wouldn't be super difficult to build, but was wondering if it was available (didn't see anything at a glance)

4 comments

Great question. This is technically referred to as "Wake Word Detection". You run a really small model locally that is just processing 500ms (for example) of audio at a time through a light weight CNN or RNN. The idea here is that it's just binary classification (vs actual speech recognition).

There are some open source libraries that make this relatively easy:

- https://github.com/Kitt-AI/snowboy (looks to be shutdown now) - https://github.com/cmusphinx/pocketsphinx

This avoids having to stream audio 24x7 to a cloud model which would be super expensive. This being said, I'm pretty sure what the Alexa does, for example, is send any positive wake word to a cloud model (that is bigger and more accurate) to verify the prediction of the local wake word detection model AFAIK.

Once you are positive you have a positive wake word detected - that's when you start streaming to an accurate cloud based transcription model like Assembly to minimize costs!

The search term you're looking for is "Keyword Spotting" (or "Wake Word Detection") - and that's what's implemented locally for ~embedded devices that sit and wait for something relevant to come along so that they know when to start sending data up to the mothership (or even turn on additional higher-power cores locally).

Here's an example repo that might be interesting (from initial impressions, though there are many more out there) : https://github.com/vineeths96/Spoken-Keyword-Spotting

Bose used to have some pre internet system that recognized the song you liked to play right after another song (like in a random shuffle) and attempted to learn what you liked to hear, and queue up the song you were likely to skip to anyway. No idea how they pulled it off since this must have been on hardware from 15 years ago iirc.
Ah yes Bose uMusic. From the manual it extracts 30 feature points from the songs to define your preference.

uMusic patent: https://patents.google.com/patent/CN1637743A/en

Further reading: http://products.bose.com/pdf/customer_service/owners/uMusic_...

This is actually a much simpler task than ASR and you can even easily train on a normal CPU even.

The best do it yourself instructions are in a book called Tiny ML.

Compared to super deep transformers, you'll find that deployed WW detectors are as simple as SVMs or 2 layer NNs.