Hacker News new | ask | show | jobs
by CSSer 1202 days ago
Is there any chance you could expose a pathway to use a local instance of Whisper? I ask primarily because OpenAI completely open-sourced Whisper in September 2022[0]. It seems odd to me to default to or encourage the usage of a paid service for something that appears to be available for free under MIT license including models[1].

My understanding is that the only reason OpenAI even setup the paid API is because it "can also be hard to run [sic]". Personally, I'm skeptical. I"m not knocking them for it but I could see how this is just brand capitalization.

[0]: https://openai.com/blog/introducing-chatgpt-and-whisper-apis...

[1]: https://github.com/openai/whisper

4 comments

If you use the large-v2 model they expose via the API, the more accurate, in your local machine, you'll see that even though it works great it's slow and won't work for long audio files because of memory limitations.

It's fairly easy and quick to run Whisper for free either locally in an Anaconda environment with Python or the command-line interface or, even better, in a Google Colab notebook.

Here's a sample notebook that builds on a notebook by Pete Warden.

https://colab.research.google.com/drive/1sxsey3n0jd09MjUd9Ky...

On a 1080Ti (so a 6 year old GPU), the large model runs in 1x time (so transcribing 10 minutes takes 10 minutes) and I've successfully transcribed even 1h+ files.
FWIF an optimized implementation I've been working on comes in at roughly 70x realtime (large-v2, beam size 5) on an RTX 3090.
Nice! Are you going to release it publicly?
Great question!

We're still very early stage and stealth so it's not quite clear to us where our lines are with regards to special sauce/significant competitive advantage.

As the CTO (and lead dev) I'd lean towards open sourcing it (because it's awesome and we're standing on the shoulders of open source giants already) but it may become clear it's too differentiating to open source. As I said it's just too early to tell.

What I can say is if we open source it HN will be the first to hear about it!

> My understanding is that the only reason OpenAI even setup the paid API is because it "can also be hard to run [sic]". Personally, I'm skeptical. I"m not knocking them for it but I could see how this is just brand capitalization.

Why is it hard to see that not every organization has the capability to set up their own translation cluster, provision GPUs, frontends, scaling, on-call rotations, regularly update models..? It's not just "brand capitalization". An API that you can call to transcribe/translate a recording with zero extra work is absolutely essential to have for most.

I have a pipeline setup in https://github.com/cnbeining/Whisper_Notebook/blob/master/Wh... .

- Run Voice Activity Detection for better timestamp output - Transcribe with Whisper - Run Forced Alignment to get per word timestamp - Create better segmented SRT - Translate(with multiple APIs - implemented DeepL, Google Translate, Baidu and a couple more)

The API is useful because not everyone has quick 10+gb vram gpus lying around.
You know, this is true. I was a bit too dismissive about it because I haven't done a lot of deploying models myself. I was making the assumption that it was similar to many other services, but even looking at pricing for managed GPUs on most instances shows me that's clearly not the case.