| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by olsgaard 3283 days ago

About identifying "The Big Question", I have a story from my days as a graduate student, where I failed to do so.

I was asked to help on a project that needed to identify humans in an audio stream. During my literature review, I came across the field of "Voice Activity Detection" or VAD, which concerns itself with identifying where in an audiosignal a human voice / speech is present (as opposed to what the speech is).

I implemented several algorithms from the literature and tested it on the primary tests sets referenced in papers and spend a few months on this until I finally asked myself "What would happen if I gave my algorithm an audiostream of a dog barking?"

The barking was identified as "voice".

As it turns out, the "Big Question" in Voice Activity Detection is not to find human voices (or any voices), but to figure out when to pass on high-fidelity signals from phone calls. So the algorithms tend to only care about audio segments that are background noise and segments that are not background noise.