Hacker News new | ask | show | jobs
by nonoesp 1202 days ago
If you use the large-v2 model they expose via the API, the more accurate, in your local machine, you'll see that even though it works great it's slow and won't work for long audio files because of memory limitations.

It's fairly easy and quick to run Whisper for free either locally in an Anaconda environment with Python or the command-line interface or, even better, in a Google Colab notebook.

Here's a sample notebook that builds on a notebook by Pete Warden.

https://colab.research.google.com/drive/1sxsey3n0jd09MjUd9Ky...

1 comments

On a 1080Ti (so a 6 year old GPU), the large model runs in 1x time (so transcribing 10 minutes takes 10 minutes) and I've successfully transcribed even 1h+ files.
FWIF an optimized implementation I've been working on comes in at roughly 70x realtime (large-v2, beam size 5) on an RTX 3090.
Nice! Are you going to release it publicly?
Great question!

We're still very early stage and stealth so it's not quite clear to us where our lines are with regards to special sauce/significant competitive advantage.

As the CTO (and lead dev) I'd lean towards open sourcing it (because it's awesome and we're standing on the shoulders of open source giants already) but it may become clear it's too differentiating to open source. As I said it's just too early to tell.

What I can say is if we open source it HN will be the first to hear about it!