Hacker News new | ask | show | jobs
by sjnair96 1366 days ago
> you can also get speaker labels with existing open source software.

Hello Nickolay :)

Diarization has always been the hard part for me, especially since it is very difficult to do comparisons within your domain. The evaluation metrics are not descriptive enough imo.

Would you say Titanet or EcapaTDNN are decent for use in production alongside, say, Whisper, or any other ASR output, if given the timestamps, so as to bypass running VAD? I'm just about to run experiments to try pyannote's diarization model and google's uis-rnn to test out how well they work, but it's a tad beyond my ability to evaluate.

I also wonder if Whisper architecture would be good for generating embeddings, but I feel it's focused so much on what is said rather than how it's said that it might not transfer over well to speaker tasks.