| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jononor 2394 days ago

If you can utilize a cloud API, speech transcription route is likely the simplest. Recognizing spoken words is challenging and data-demanding when it can be spoken by many different speakers.

But if you want to do this on the audio you chop up your audio stream into fixed-length (in time) analysis windows. These length of the window should be a bit longer than the sound of interest (the word). Overlap is normally used for the windows. Say with 90% overlap the next window is created by moving forward by 10%. This gives the model multiple "shots" at detecting the word as it passes by. This is suitable for spotting a word and giving the time within something like 50ms resolution.

For each analysis window you apply feature pre-processing and a model such as the one shown in the article.

This task sounds like what is called Keyword Spotting in academic literature. Which can be seen as as specific version of Audio Event Detection, applied to spoken words.