Hacker News new | ask | show | jobs
by nik_s 1366 days ago
I just tested the model [1] using an RTX3090, trying to translate a french text I found here [2].

Some observations:

- The full translation of the 6:22 minute video takes about 22 seconds (17x real time)

- It recognizes the language by default (and did a good job to recognize it was french audio)

- MIT License [3]!

- The quality of the transcription is good, but not perfect.

- The quality of the translation (if you don't consider transcription errors as a translation error) is generally very good.

---

The transcription:

> Bonjour à tous, <error>j'suis</error> espère que vous allez bien, c''est ENTI. Et aujourd', <error>aujourd',</error> on se retrouve <error>un peu physique</error> pour parler de la termo dynamique. Vous ne vous inquiétez pas, ça va bien se passer. On va y aller ensemble, <error>être à par exemple,</error> je vous accompagne à travers une série de vidéos pour vous expliquer les principes de base en termo dynamique. Et bah, c''est parti, on va y aller tranquillement. Lidée, c''est vous puissiez comprendre la termo dynamique dans son ensemble. Donc, je vais vraiment prendre mon temps pour <error>couplisser</error> bien comprendre les notions,

The translation:

> Hello everyone, I hope you're doing well, it's NT and today we find ourselves a little physical to talk about the thermo dynamic. Don't worry, it's going well, we're going to go together and be the same. I'm going to accompany you through a series of videos to explain the basic principles in thermo dynamic. Well, let's go, <error>we're going to go quietly</error>. The idea is that you can understand the thermo dynamic <error>in sound together</error>. So I'm really going to take my time to understand the notions,

---

All in all very happy that OpenAI is publishing their models. If Stable Diffusion is any guide, people will hack some crazy things with this.

[1] https://github.com/openai/whisper [2] https://www.youtube.com/watch?v=OFLt-KL0K7Y [3] https://github.com/openai/whisper/blob/main/LICENSE

5 comments

It also runs well on a CPU and seems to have proper memory management. Wonderful timing because I was using DeepSpeech for some audio recordings and it required me to script up a splitter to make the files into .wav and then do snippets of 10 seconds each. Everything about this just works out of the box. On a core i5 I'm getting about 30 seconds every minute. Transcriptionist jobs just turned into editor jobs. I love how it drops the inflections in the audio as well, because it was trained on transcription work, and that is one of the first things you learn to do (drop the uhs and ums and huhs etc, unless it is a strictly verbose transcription).
> dans son ensemble

> in sound together

That's hilarious and honestly, incredibly bad. "Dans son ensemble" is a very common idiom (meaning "as a whole") while "in sound together" has to be pretty rare. "Son" means "his/hers/its" as well as "sound", and the former meaning is probably more common in general so I have no idea how this result could arise.

"Termo" also doesn't exist in French, it's "thermo", so the transcript even makes orthographic errors.

And I forgot about "couplisser" which is also a hilarious made-up word that sounds like it could mean something, but doesn't! Edit Google finds exactly one reference of this, in a patent with a typo on the word "coulisser".

I'm still impressed by the transcript quality since it covers many languages, but the translation part is quite poor.

Was this with the `base` model? `large` is running ok on a P100 in colab, but is about 4% the speed of `base.en`. Certainly seems like some of these models will be fast enough for real-time.
Is it translation or transcription? Or both?

Both, wow. This is really interesting.

Both, the blog covers it in detail. Pass in audio in any language, and get an English transcription out.
It can do both - I've edited my original post to show the translation task.
How did you get it to use the GPU?

I have it running right now and it's not touching the GPU.

--device "cuda"
My version of pytorch didn't have CUDA. I had to install conda to get it, and now it's currently installing.

Whatever the default version that `pip install git+https://github.com/openai/whisper.git` grabbed didn't include it by default.

I installed Whisper (and, I thought all the needed dependencies), and had it running on my M1 Max MacBook Pro with 64 GB ram, but it ran TERRIBLY slowly... taking an hour to do a couple of minutes...

I found this thread and wondered if Whisper was accessing all the cores or the gpu, so I've spent a couple of hours trying to get whisper to access the gpu - following the points made in this thread, and googling how to install via brew the various components.

Long story short, I keep getting an error message

"RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU."

or when I set --device to gpu, it get the error: "RuntimeError: don't know how to restore data location of torch.storage._UntypedStorage (tagged with gpu)"

it's been a looong time since I wrote any code (remember basic?), so realise I may be missing a lot here!!

does anyone have any pointers?

thanks!

edit: I'm now trying it one more time after trying to set the cpu using this line:

map_location=torch.device('gpu')

and I get this message as whisper begins: ~/opt/anaconda3/lib/python3.9/site-packages/whisper/transcribe.py:78: UserWarning: FP16 is not supported on CPU; using FP32 instead warnings.warn("FP16 is not supported on CPU; using FP32 instead")

then I wait for whisper to do it's magic ...tho it looks like it will remain very slow...