Hacker News new | ask | show | jobs
by arthurcolle 450 days ago
is there some way to do a simple fingerprint or something so that the AI recognizes when it was the one speaking? or do you really just have to WebRTC. I spoke with someone yesterday who told me WebRTC fixed this, so just curious.

I wrote a "simple" (ugly) Acoustic Echo Cancellation module that kind of worked, but wondering if anyone had any solutions to make it work over the WebSockets Realtime API

2 comments

What you're looking for is speaker embeddings. It's an embedding calculated from an audio snippet. As the other commenter mentioned, it should be combined with a robust voice isolation system.

My own system automatically detects new speakers and tries to pick up on cues to identify the speaker, and once they are identified by name, the corresponding average embedding is inserted into a vector database so that the agent can later use the embedding for simple authentication, ignoring chatter in noisy public spaces, RAG context loading, etc. It works pretty well!

Does this work well for multi-user scenarios? I also wanted to as a side effect tag and label people, but not really used to the audio setting. Just found "Speaker Verification with xvector embeddings on Voxceleb" which seems interesting and useful.
Within constraints, yes, it does, but I think there are many improvements I could still make. Speaker diarization and identification are ongoing subjects of research and right now there's not a good end-to-end model, so if your constraints are local inference only or low latency, it can be harder to get amazing results with current hardware and off-the-shelf models. It's still a lot better than nothing.
The hard part is to separate background voices (e.g. TV, chatter, etc) from the primary speaker's voice. Basically do voice isolation. Voice fingerprinting would help only in this context.