Hacker News new | ask | show | jobs
by soulofmischief 450 days ago
What you're looking for is speaker embeddings. It's an embedding calculated from an audio snippet. As the other commenter mentioned, it should be combined with a robust voice isolation system.

My own system automatically detects new speakers and tries to pick up on cues to identify the speaker, and once they are identified by name, the corresponding average embedding is inserted into a vector database so that the agent can later use the embedding for simple authentication, ignoring chatter in noisy public spaces, RAG context loading, etc. It works pretty well!

1 comments

Does this work well for multi-user scenarios? I also wanted to as a side effect tag and label people, but not really used to the audio setting. Just found "Speaker Verification with xvector embeddings on Voxceleb" which seems interesting and useful.
Within constraints, yes, it does, but I think there are many improvements I could still make. Speaker diarization and identification are ongoing subjects of research and right now there's not a good end-to-end model, so if your constraints are local inference only or low latency, it can be harder to get amazing results with current hardware and off-the-shelf models. It's still a lot better than nothing.