|
|
|
|
|
by r2_pilot
698 days ago
|
|
Thank you for your insight. It confirms some of my suspicions working in this area (you wouldn't happen to know anybody who makes anything more modern than the Respeaker 4-mic array?). My biggest problem is even with AEC, the voice output is triggering the VAD and so it continually thinks it's getting interrupted by a human. My next attempt will be to try to only signal true VAD if there's also sound coming from anywhere but behind, where the speaker is. It's been an interesting challenge so far though. |
|
I'm really glad you saw this. So, so, so much time and hope was wasted there on the Nth team of XX people saying "how hard can it be? given physics and a lil ML, we can do $X", and inevitably reality was far more complicated, and it's important to me to talk about it so other people get a sense it's not them, it's the problem. Even unlimited resources and your Nth fresh try can fail.
FWIW my mind's been grinding on how I'd get my little Silero x Whisper gAssistant on device replica pulling off something akin to the gpt4o demo. I keep coming back to speaker ID: replace Silero with some newer models I'm seeing hit ONNX. Super handwave-y, but I can't help thinking this does an end-around both AEC being shit on presumably most non-Apple devices, and poor interactions from trying to juggle two things operating differently (VAD and AEC). """Just""" detect when there's >= 2 simultaneous speakers with > 20% confidence --- of course, tons of bits missing from there, ideally you'd be resilient to ex. TV in background. Sigh. Tough problems.