Hacker News new | ask | show | jobs
by satvikpendem 1058 days ago
Can I not do the same thing with Whisper to transcribe and then pipe the data into my LLM of choice?
2 comments

Their ASR model is Conformer trained on 1.1M hours, so the result should be better than Whisper. From their pricing page, with ~ length of a meeting, input size 15000 tokens (60 minutes audio file), output size 2000 tokens (1500 words), LeMUR default, the price estimate is $0.353, which is I think a fairly good price. This tool can save a lot of time for a secretary, even replace them. But I think sending your meeting data is still quite risky.
Comparison by competitor but it’s believable IMO. Basically about the same performance as whisper:

- https://deepgram.com/learn/nova-speech-to-text-whisper-api

Not surprising though as at this level all these options are starting to be leveled by inconsistencies in manual groundtruth. Conformer alone also isn’t the most powerful architecture out there for speech. This is also slower than, say running a large k2 zipformer via onnx on cpu.

Also if you have a small shop at this point you can do all of this yourself with whisper large v2 on a single 16gb gpu via some tweaking of https://github.com/guillaumekln/faster-whisper and an OSS LLM.

Interesting stuff but I think margins in this space are getting ready to simply vanish.

Deepgram will correlate the text in your transcription with the timestamp where that was uttered. This is really really impressive and useful.
I'd recommend just trying the Colab in my comment above to test out how quick you can do what you want with LeMUR versus building your own. Piping in 100 hours of audio into an LLM can be a lot of work compared to an API call, but it'll depend on what you are building