Hacker News new | ask | show | jobs
by jeffharris 461 days ago
We're thinking about diarization (adding time awareness to GPT models) but no firm plans to share just yet
2 comments

Jeff you know what would be magical? Not just vanilla diarization "Speaker 1" and "2" but if the model can know from the conversation this speaker was referred to as "Jeff Harris" or "Jeff" so it uses that instead.
Or if we could even provide samples of what an example speaker sounds like in general so that it would always classify them the way we want.
The feature I want is speaker differentiation - I want to feed in an audio file and get back a transcript with "Speaker 1: ..., Speaker 2: ..." indications.

That plus timestamps would be incredible.

The Google Gemini 2.0 models are showing some promise with this, I can't speak to their reliability just yet though.

I had good results with pyannote and the following model for that use case in the past https://huggingface.co/pyannote/speaker-diarization-3.1
I thought Deepgram already did speaker diarization (which is differentiation) pretty well. That and it can include timestamps plus other metadata.
WhisperX does all of this, I use it all the time to transcribe meeting notes. Both speaker differentiation and individual word timestamps.