I run a small Dwarf Fortress podcast, and I didn't like the transcription options when we started a few years ago, so I wrote some python glue to do diarization (separate out speakers) and transcription using a torchaudio project, and either whisper or openai depending on how I'm feeling that day. Works surprisingly well, with timestamps and clean-up: