Jeff you know what would be magical? Not just vanilla diarization "Speaker 1" and "2" but if the model can know from the conversation this speaker was referred to as "Jeff Harris" or "Jeff" so it uses that instead.
The feature I want is speaker differentiation - I want to feed in an audio file and get back a transcript with "Speaker 1: ..., Speaker 2: ..." indications.
That plus timestamps would be incredible.
The Google Gemini 2.0 models are showing some promise with this, I can't speak to their reliability just yet though.