Hacker News new | ask | show | jobs
by corlinp 79 days ago
This is exactly the case today. Multimodal LLMs like gpt-4o-transcribe are way better than traditional ASR, not only because of deeper understanding but because of the ability to actually prompt it with your company's specific terminology, org chart, etc.

For example, if the prompt includes that Caitlin is an accountant and Kaitlyn is an engineer, if you transcribe "Tell Kaitlyn to review my PR" it will know who you're referring to. That's something WER doesn't really capture.

BTW, I built an open-source Mac tool for using gpt-4o-transcribe with an OpenAI API key and custom prompts: https://github.com/corlinp/voibe

1 comments

Many ASR models already support prompts/adding your own terminology. This one doesn't, but full LLMs especially such expensive ones aren't needed for that.
A lot of them like Whisper are severely limited on context size for adding your own terminology