Hacker News new | ask | show | jobs
by thecleaner 485 days ago
> Speech to text using whisper is almost perfect.

This isn't true. On benchmarks whisper is not SOTA. It is said to be noise resistant but it doesn't compare well with Conformer based architectures ever on Librispeech mixed. Definitely not perfect, and it doesn't work for medical transcription.

1 comments

If anything other than a small minority of people needed medical transcripts, sure! But for the remaining 99% of use cases, a fast and easy to deploy model is what's most useful.
Its not fast without pre-segmentation as they do in WhisperX. It actually has terrible transcription speed. For speedup we have to use Ctranslate2 kernels. The decoding code is also a mess where its hard to plug your own custom language model. Not to mention streaming ASR requires even more tweaks. Whisper Small is very fast and quite inaccurate. If you deploy whisper on a GPU which costs around dollar per hr, you really to ensure that the cost savings are worth it.

Although all of this is from a production lens. For personal use, honestly nothing is as easy to use as Whisper (even works on a laptop).