Hacker News new | ask | show | jobs
by alkonaut 497 days ago
Something I find annoying with automatic transcriptions and summaries, like the one built into Teams, is that they lack the context necessary to properly interpret what's being said. Example if I have a meeting discussing products, abbreviations or systems with "internal" names then it can't discern them or statistically rejects them, replacing them with its best guess for a dictionary word instead. So say we have a long call involving frequent mentions about a measure called pNet pronounced in the meeting "Peenet". Then you end up with a transcription of a bunch of guys having a discussion about penises. Hilarious, the first few times. OK always hilarious, but not so useful.

Being able to set the system prompt for these transcriptions would be very useful. Like "You are a friendly bot transcribing meetings at a software company. Some common terms and abbreviations you'll encounter are...".

3 comments

My favourite was Kubernetes in our meeting being referred to as Cuban Eighties. ⎈
Anecdotally, if you have an accent and want to reference Maltese Falcon[1], your voice recognition software may understand it as “Maltese f* off”.

[1]: https://en.m.wikipedia.org/wiki/The_Maltese_Falcon_(1941_fil...

Perhaps these will be flagged for the CIA or DEA to investigate due to illegal importation of Cubans from the enemy!
This should be trivially solveable with a glossary as context, as you suggest. I bet the above repo would love a PR, too!
But the error happens in 'audio to text' part, so text prompt won't solve it. The way to fix it is probably fine-tuning the underlying audio to text model.
Doing audio-to-text requires having a statistical model for what word or phrase a piece of sound is most likely to be. Without context, you can't do better than ranking the most likely candidates where a common word is more likely than an uncommon one. Having a task-specific dictionary at that point would help.

One could also imagine doing it at the summary step where the AI could simply be asked to do phonetic analysis. "Here is a transcription of a meeting. Here is a list of terms/names/participants etc. Given the transcription, the meeting context/topics and assuming the transcriptor has made errors, replace similarly sounding words and terms with more likely ones from the context"

Whisper accepts a system prompt.
Gong has such a feature. It’ll even expand out acronyms the first time they show up in the transcript.