If you want a quick and free web transcription and editor tool, We've built https://revoldiv.com/ with speaker detection and timestamps. Takes less than a minute to transcribe 1 hour long video/audio
Good point but the problem with local hosting is that if you want to use the larger models it will take a long time to transcribe a file. We use multiple gpus and we do speaker detection, sound detection and it is has a rich audio editor.
Totally agree, having built a similar app I know speaker diarization is a killer feature that's hard to get. My problem is I'll never share these recordings ;).