| Whisper works ... kinda. I'm hoping there's another set of models released at some point, the error rate isn't appalling to me because i am transcribing TV shows and radio shows for personal use, so it's not mission critical. There are a few whisper diarization "projects" but i've never been able to get it to work. Whisper does have word-level timestamps, so it should be simple to "plug in" diarization. I don't need an LLM or whatever this project has, but i will see if it's runnable and if it's any better than what a couple podcasts i listen to use. edit: see some people mentioning whisperx, which is one of those things that was cool until moving fast broke things: >As of Oct 11, 2023, there is a known issue regarding slow performance with pyannote/Speaker-Diarization-3.0 in whisperX. It is due to dependency conflicts between faster-whisper and pyannote-audio 3.0.0. Please see this issue for more details and potential workarounds. which means that what i gain is a ~3x increase in large-v2 speeds but i instantly lose those gains with diarization, unless i track down 8 month old bug workarounds. I'll stick with the py venv whisper install i've been using for the last 16 months, tyvm |
https://github.com/MahmoudAshraf97/whisper-diarization
I remember having the usual python package hell when NeMo was updated somewhere, but it seems to be decently well maintained so give it a go.
*Edit, I remember reading somewhere that pyannote was a weak link in other repos, that might be why your other tests were not great.