Hacker News new | ask | show | jobs
by opt1c 949 days ago
> It’s “underused” precisely because this feature is pretty much useless if you’re transcribing anything other than quick snippets of speech.

I'm not sure why you're so dismissive when real-time transcription is an important use-case that falls under that bucket of "quick snippets".

> It’s also hard to use since you have to know in advance what hard-to-transcribe words are going to be in the audio.

I think it's more context-dependent than it is "hard". It's ideal for streaming meeting transcripts. In my use-cases, I use the prompt to feed in participant names, company terms/names, and other potential words. It's also much easier to just rattle off a list of potential words that you know are going to be in the transcription that are difficult or spelled differently.

> We need a better solution. It would be much better if there were an easy way to fine tune Whisper to learn new vocab.

Prompting is infinitely easier than fine-tuning in every aspect. I can reuse the same model in any context and just swap out the prompt. I don't have to spend time/money finetuning... I don't have to store multiple fine-tuned copies of whisper for different contexts... I'm not sure what better solution you envision but fine-tuning is certainly not easier than prompting.

1 comments

Real time transcription is not necessarily short snippets. In my experience, initial prompt is useless beyond the first 30 seconds if the words in the initial prompt aren’t used every 30 seconds, including the first 30.

It may be easy to rattle off a list of words, but it doesn’t work nearly as well as it should, so what’s the point? I also never said fine tuning would be easier than prompting. I said it would be better. It would just need to be easier than fine tuning currently is, not easier than prompting.

Fine tuning that I’m talking about would not be limited to only a few new words. You would only need one model, like we have today. It would just be your model that knows all the specific words and spellings you prefer. By analogy to other machine learning models, I would expect a lightweight LoRA approach would also work.

I just haven’t seen anyone working on these solutions that would actually be scalable, unlike the initial prompt.

Initial prompt works in extremely specific scenarios, but it has been so unreliable for long transcripts in my experience that I certainly don’t bother with it anymore. Someone mentioned Alexa-style home assistants, which would have short enough audio snippets that initial prompt would actually be useful.