Hacker News new | ask | show | jobs
by cyp0633 327 days ago
The same happens with whisper-large-v3 on Chinese transcription: silence is transcribed to something like "please upvote, share and favourite this video". I suspect they trained the model on some random YouTube video without carefully picking really useful data.
11 comments

In Chinese, it always added something like "For study/research purpose only. Please delete after 48 hours." This is what those volunteers added in subtitles of (pirated) movies/shows.
Fair, if AI companies are allowed to download pirated content for "learning", why ordinary people cannot.
There is so much damning evidence that AI companies have committed absolutely shocking amounts of piracy, yet nothing is being done.

It only highlights how the world really works. If you have money you get to do whatever the fuck you want. If you're just a normal person you get to spend years in jail or worse.

Reminds me of https://www.youtube.com/watch?v=8GptobqPsvg

There's actually a lot of court activity on this topic, but the law moves slowly and is reluctant to issue injunctions where harm is not obvious.

It's more that the law about "one guy decides to pirate twelve movies to watch them at home and share with his buddies" is already well-settled, but the law about "a company pirates 10,000,000 pieces to use as training data for an AI model (a practice that the law already says is legal in an academic setting, i.e. universities do this all the time and nobody bats an eye)" is more complicated and requires additional trials to resolve. And no, even though the right answer may be self-evident to you or me, it's not settled law, and if the force of law is applied poorly suddenly what the universities are doing runs afoul of it and basically nobody wants that outcome.

What’s ironic to me is that had these companies pirated only a single work, wouldn’t that be a chargeable crime?

Clearly Bonnie and Clyde shouldn’t have been prosecuted. Imagine they were just robbing banks for literary research purposes. They could have then used the learnings to write a book and sell it commercially…

Or imagine one cracks 10000 copyrighted DVDs and then sells 30 second clips… (a derived work).

To me, for profit companies and universities have a huge difference — the latter is not seeking to directly commercially profit from copyrighted data.

There is a distinction that must be made that very few people do, but thankfully the courts seems to grasp:

Training on copyright is a separate claim than skirting payment for copyright.

Which pretty much boils down to: "If they put it out there for everyone to see, it's probably OK to train on it, if they put it behind a paywall and you don't pay, the training part doesn't matter, it's a violation."

Whether it’s legal slash fair use to train on copyrighted material is only one of the questions currently being asked though. There’s a separate issue at play where these companies are pirating the material for the training process.

By comparison, someone here brought up that it might be transformative fair use to write a play heavily based on Blood Meridian, but you still need to buy a copy of the book. It would still be infringement to pirate the e-book for your writing process, even if the end result was legal.

So if I download copyrighted material like the new disney movie with fansubs and watch it for training purposes instead of enjoyment purposes it's fine? In that case I've just been training myself, your honor. No, no, I'm not enjoying these TV shows.

Because it's important to grasp the scale of these copyright violations:

* They downloaded, and admitted to using, Anna's Archive: Millions of books and papers, most of which are paywalled but they pirated it instead

* They acquired Movies and TV shows and used unofficial subtitles distributed by websites such as OpenSubtitles, which are typically used for pirated media. Official releases such as DVDs tend to have official subtitles that don't sign off with "For study/research purpose only. Please delete after 48 hours" or "Subtitles by %some_username%"

If you owe the bank $1,000 you have a problem.

If you owe the bank $100,000,000 the bank has a problem.

We live in an era where the president of the United States uses his position to pump crypto scams purely for personal profit.

10% for the big don
The dead corpses of filmmakers and authors and actors are buried in unmarked graves out behind those companies' corporate headquarters. Unimaginable horror, that piracy. Why has no one intervened?

>If you're just a normal person you get to spend years in jail or worse.

Not that I'm a big fan of the criminalization of copyright infringement in the United States, but who has ever spent years in jail for this?

Besides, if it really bothered you, then we might not see this weird tone-switch from one sentence to the next, where you seem to think that piracy is shocking and "something should be done" and then "it's not good tht someone should spend time in jail for it". What gives?

> who has ever spent years in jail for this?

Aaron Swartz?

EDIT: apparently he wasn't in jail, he was on bail while the case was ongoing - but the shortest plea deal would still have had him in jail for 6 months, and the penalty was 35 to 50 years.

Nope, he didn't go to jail.
> Besides, if it really bothered you, then we might not see this weird tone-switch from one sentence to the next, where you seem to think that piracy is shocking and "something should be done" and then "it's not good tht someone should spend time in jail for it". What gives?

What a weirdly condescending way to interpret my post. My point boils down to: Either prosecute copyright infringement or don't. The current status quo of individuals getting their lives ruined while companies get to make billions is disgusting.

> Either prosecute copyright infringement or don't

This is the absolute core of the issue. Technical people see law as code, where context can be disregarded and all that matters is specifying the outputs for a given set of inputs.

But law doesn’t work that way, and it should not work that way. Context matters, and it needs to.

If you go down the road of “the law is the law and billion dollar companies working on product should be treated the same as individual consumers”, it follows that individuals should do SEC filings (“either require 10q’s or don’t!”), and surgeons should be jailed (“either prosecute cutting people with knives or don’t!”).

There is a lot to dislike about AI companies, and while I believe that training models is transformative, I don’t believe that maintaining libraries of pirated content is OK just because it’s an ingredient to training.

But insisting that individual piracy to enjoy entertainment without paying must be treated exactly the same as datasets for model training is the absolute weakest possible argument here. The law is not that reductive.

No one (in the US) has been jailed for downloading copyrighted material.
https://en.wikipedia.org/wiki/Aaron_Swartz

And the US is not the only jurisdiction

That's not the same as piracy though. He wasn't downloading millions of scientific papers from libgen or sci-hub, he was downloading them directly from jstor. Indeed, none of his charge was for copyright infringement. It was for stuff like "breaking and entering" and "unauthorized access to a computer network".
Aaron Swartz was not jailed or even charged for copyright infringement. The discussion and the comment I replied to is centered around US companies and jurisdiction.
There could be a moral question. For example a researcher might not want to download a pirated paper and cause loss to a fellow researcher. But it becomes pretty stupid to pay when everyone, including large reputable companies endorsed by the government, is just downloading the content for free. Maybe his research will help developing faster chips to win against China, why should he pay?

Would it be a "fair use" to download pirated papers for research instead of buying?

Also I was gradually migrating from obtaining software from questionable sources to open source software, thinking that this is going out of trend and nobody torrents apps anymore, but it seems I was wrong?

Or another example: if someone wants to make contributions to Wine but needs a Windows for developing the patch, what would be the right choice, buy it or download a free copy from questionable source?

Researchers don't get paid when their papers are downloaded, though. They pay to have their papers downloaded, and the middleman makes money on both sides. Piracy is the only moral option for them. There is a reason every single competent professor in the western world will email you a free copy of their papers if you ask nicely.
What about people filming movies in the cinema (for learning of course)? [1]

[1] https://www.thefederalcriminalattorneys.com/unauthorized-rec...

No, if you revolutionize both the practice and philosophy of computing and advance mankind to the next stage of its own intellectual evolution, you get to do whatever the fuck you want.

Seems fair.

Hm. Not a given that it's an advance.
I get the common cynical response to new tech, and the reasons for it.

We wish we lived in a world where change was reliably positive for our lives. Often changes are sold that way, but they rarely are.

But when new things introduce dramatic capabilities that former things couldn't match (every chatbot before LLMs), it is as clear of an objective technological advance as has ever happened.

--

Not every technical advance reliably or immediately makes society better.

But whether or when technology improves the human condition is far more likely to be a function of human choices than the bare technology. Outcomes are strongly dependent on the trajectories of who has a technology, when they do, and how they use it. And what would be the realistic (not wished for) outcome of not having or using it.

For instance, even something as corrosive as social media, as it is today, could have existed in strongly constructive forms instead. If society viewed private surveillance, unpermissioned collation across third parties, and weaponizing of dossiers via personalized manipulation of media, increased ad impact and addictive-type responses, as ALL being violations of human rights to privacy and freedom from coercion or manipulation. And worth legally banning.

Ergo, if we want tech to more reliably improve lives, we need to ban obviously perverse human/corporate behaviors and conflicts of interest.

(Not just shade tech. Which despite being a pervasive response, doesn't seem to improve anything.)

At the risk of stepping on a well-known land mine around here, how'd you do on the IMO problem set this year?
Except that the jury’s (at best) still out on whether the influence of LLMs and similarly tech on knowledge workers is actually a net good, since it might stunt our ability to critically think and problem solve while confidently spewing hallucinations at random while model alignment is unregulated, haphazard, and (again at best) more of an art than a science.
Well, if it's no big deal, you and the other copyright maximalists who have popped out of the woodwork lately have nothing to worry about, at least in the long run. Right?
>why ordinary people cannot

They can. I don't think anyone got prosecuted for using an illegal streaming site or downloading from sci-hub, for instance. What people do get sued for is seeding, which counts as distribution. If anything AI companies are getting prosecuted more aggressively than "ordinary people", presumably because of their scale. In a recent lawsuit Anthropic won on the part about AI training on books, but lost on the part where they used pirated books.

People got in trouble for filming in the cinema as I understand, there is a separate law for that.
But in that case even though filming isn't technically distribution, it's clearly a step to distributing copies? To take this to the extreme, suppose you ripped a blu-ray, made a thousand copies, but haven't packaged or sold them yet. If the FBI busted in, you'd probably be prosecuted for "conspiracy to commit copyright infringement" at the very least.
It's just "training"
IANAL, but reading a bit on this topic: the relevant part of the copyright law for AI isn't academia, it's transformative work. The AI created by training on copyrighted material transforms the material so much that it is no longer the original protected work (collage and sampling are the analogous transformations in the visual-arts and music industries).

As for actually gathering the copyrighted material: I believe the jury hasn't even been empaneled for that yet (in the OpenAI case), but the latest ruling from the court is that copyright may have been violated in the creation of their training corpus.

AFAIK, downloading or watching pirated stuff isn't something you'll get in trouble for. Hosting and distributing it is what will get you.
Well, it just shows that they've downloaded subtitles.
Interesting, in Russian, it often ends with "Subtitles by %some_username%"
That is not the case here - I never encountered this with whisper-large-v3 or similar ASR models. Part of the reason, I guess, is that those subs are burnt into the movie, which makes them hard to extract. Standalone subs need the corresponding video resource to match the audio and text. So nothing is better than YouTube videos which are already aligned.
At least for English, those "fansubs" aren't typically burnt into the movie*, but ride along in the video container (MP4/MKV) as subtitle streams. They can typically be extracted as SRT files (plain text with sentence level timestamps).

*Although it used to be more common for AVI files in the olden days.

SRT is ancient. Nowadays everyone uses ASS subtitles which can be randomly styled.
In general? In the past I've known ASS to be used a lot for things like anime, but less for live action shows.
I have also found them inside mkvs as the subtitle track. I think SRT was the default because most content was ripped from DVD/BD, but now most of the content is from streaming sources and you need to convert the subtitles anyway.
WebVTT (a SubRip successor) is probably more widely used than ASS
By legit providers, probably.
flashbacks of trying to track down subs sync’d to a specific release
Indeed, with another model I would get persistent transcriptions of silent parts into 'Thanks for watching!' or '[MUSIC]'. Pretty dumb that this failure mode wasn't caught in some QA process, and there are now multiple transcription models suffering from the same issue. Having silent parts in your input audio seems like it should be a very common occurrence...
When I was taught mathematics, the zero value was always considered the most important edge case. You prove something for N=0 (or N=1), then for N=M+1.

It's even more important in audio DSP: processing near-zeroes can end up being extremely CPU intensive, look up denormal/subnormal floats.

Yeah, I studied mathematics (algebra and number theory) and zero is the point, often sporting discontinuities, or weird asymptotic behavior.

Quite a lot of algorithms use some form of division and zero is the only number in our typical structures (Z, Q, R, C), that cannot be used to divide with.

In machine integer arithmetics, one must also beware division by -1, which can convert MIN_INT into MIN_INT with a signed overflow and violate some arithmetics invariants, such as sign (negative divided by negative is _usually_ positive).
Well, now in this brave new age of AI we can enjoy computer programs crashing with an

    Error: division by please upvote, share and like!
This also works; I upvoted your comment.
I have discovered a truly marvelous proof of how to smash that like and subscribe button, which this comment box is too small to contain.
Denormals are flushed to zero by default on most GPUs by the way.
Makes total sense, execution time is bounded. The point is it's still a case you must consider (what if near-zero is distinct from zero and significant?)
whisper MUST be combined with silence detection / VAD
Ah, the good old "you're holding it wrong".

What good is a speech recognition tool that literally hears imaginary voices?

Considering that if you DO use VAD (voice activity detection), it's the best open weights voice recognition model by a very wide margin, it's quite good. I'd be willing to be that commercial products that "don't have this problem" are using VAD as well, and that this is well known to them. But Whisper is just the weights, and I suppose a simple reference implementation, not a full product.
> What good is a speech recognition tool that literally hears imaginary voices?

Well, if it is supposed to work after silence detection, then it is good for speech recognition I guess. It's like blaming a wheel why is it circular, you can't sit on it. It's a part of a larger machine.

Just lay the wheel on its side and it makes a fine seat.
>imaginary voices

On the other hand, I can imagine that when things get quiet and the signal-to-noise ratio gets close to zero, random background audio (or randomness introduced in the transcription model) will be enough to tickle a critical number of neurons and elicit hallucinations.

The related thought exercise is this: Try scanning across the band with an AM or sideband radio, and after a while your brain will start to wonder "was that a voice I just heard, or music perhaps?" when in reality it was just environmental static.

Yes, you are holding it wrong. The good of it is that it does not output imaginary voices when used with VAD.

Show us a technology with better results that does not use VAD. If you can’t, then I’m not sure what you’re arguing against except superficialities so inconsequential that I can’t comprehend the condescension. The results speak for itself

faster-whisper has a min_silence_duration_ms option
There are much higher quality VAD solutions available
Please name a couple to get someone started who's hacking on webapps?

I'd really appreciate it.

So if a tool has a process to have it perform at its best then it's a problem?

Do you also moan that before applying glue to a surface or it won't stick? Or if you need to drill a guiding hole before making a larger one in wood? Or that you need to use truly prime numbers for a security key to actually be safe?

What's a good starter VAD lib, and if you know, the best implementation of something like this to use in a browser-based app?

Say if I wanted to use it for Voice Nav, or Voice Input, but not piss off random people speaking the wrong language.

If that's truly the case then they should make it part of the product, IMHO.
How is it not the case? It is unusable without VAD or editing. I don't understand what you're questioning

I agree their products could be better "end to end" integrated. Meanwhile there is a continuously-improving field of work for detecting speech (which Whisper is incapable of). They offer official "cookbooks" with guidance on an approach they recommend: https://cookbook.openai.com/examples/whisper_processing_guid...

> At times, files with long silences at the beginning can cause Whisper to transcribe the audio incorrectly. We'll use Pydub to detect and trim the silence.

(Official OpenAI quote)

What's VAD?
Voice Activity Detection (it predicts whether a short clip contains speech, eg to mute your microphone when you aren't speaking).
Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?
I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
I can. He was asking if Babbage was cheating.

You put in 2+2 - the right figures. The machine says 4 - the right answer. If you put in the wrong figures, like 3+3, will the machine still say 4? It's easy to make a machine that always says 4.

The people who asked him that question, however, probably got a different scam demonstrated to them every every. Remember the Mechanical Turk? Babbage's reply paints him very honestly. It shows that he couldn't even conceive that someone might try to trick the royal court (or whoever it was) into accepting a fake device.

Having zero exposure to any form of computation for your entire life, as the vast majority of people in the early 19th century were.
What's the defence for the current population?
When YouTube began building automatic transcriptions for captions, it regularly flagged any noise or music -- typically industrial noise -- with "[foreign]"

If it couldn't understand it, it was "foreign" for the longest time.

Hey, Netflix occasionally still puts in its English subtitles "[foreign music]", it always cracks me up.
[speaks japanese]

To be fair, there is a difference between when subtitles match the source language and when they don't. Former are often verbatim.

Haha, yes, it's fair when English subtitles write something like [speaks Japanese], especially when at least one of the characters is not supposed to understand what's being said (when they do, it's more appropriate to write "[in Japanese]: let's go shopping!").

Netflix sometimes takes the cake with what I consider the most outrageous option: writing "[in English]" when they mean "in whatever language the protagonist considers native", which is mind-bogglingly wrong and hilarious at the same time.

They do this with the English subtitles of the German production "Die Kaiserin" ("The Empress"): whenever Sisi is speaking in another language, say French, the subtitles will say "[in French] I love you...", and when she switches back to German they will say "[in English] I love you...". WTF, Netflix? Note this is unrelated to understanding German; it's mostly Netflix looking down on its customers and assuming they cannot comprehend there are people in the world for whom their native tongue is different to the viewer's native tongue.

This has happened in more shows, enough to know it's not a fluke, though Netflix is inconsistent about it.

[laughs in Japanese]
Yeah, I can confirm seeing that a fair bit specifically during non-verbal parts of videos when someone is using a tool.
Can confirm as well, although to my recollection it just shows up as if it's a word the transcription model heard, not "[foreign]" in brackets like with "[Music]" or "[Applause]". It's especially weird to me because I recall the auto-transcriptions being reasonably serviceable when they first rolled them out, only to degrade over time to the point where it was hallucinating the word "foreign" and dropping letters from words or using weird abbreviations (like "koby" for "kilobyte", "TBTE" for "terabyte", or, most memorably weirdly, transcribing the phrase "nanosecond-by-nanosecond" as "nond by nanc") if it didn't decide it heard another one entirely.

I also noticed a couple of months ago that YouTube seems to have quietly rolled out a new auto-transcription model that can make reasonable guesses at where capitalization, punctuation, and sentence boundaries should go. It seems to have degraded even more rapidly than the old one, falling victim to the same kinds of transcription errors. Although the new one has a different hallucination in silence and noise that it wasn't able to classify (which, incidentally, its ability to recognize things like music and applause seems worse than the old one's): where the old model would have hallucinated the word "foreign", the new one thinks it's hearing the word "heat", often repeated ("Heat. Heat.").

That's interesting, the few times I tried playing with whisper, I had the impression that YouTube style videos or random cellphone videos was something it did particularly bad with (compared to movies). My guess at the time was that most of the training material might be sub titles and raw screen plays.

The videos I tried to transcribe were also Mandarin Chinese, using whisper-large-v3. Besides the usual complaints that it would phonetically "mishear" things and generate nonsense, it was still surprisingly good, compared to other software I played around with.

That said, it would often invent names for the speakers and prefix their lines, or randomly switch between simplified and traditional Chinese. For the videos I tested, intermittent silence would often result in repeating the last line several times, or occasionally, it would insert direction cues (in English for some reason). I've never seen credits or anything like that.

In one video I transcribed, somebody had a cold and was sniffling. Whisper decided the person was crying (transcribed as "* crying *", a cough was turned into "* door closing *"). It then transcribed the next line as something quite unfriendly. It didn't do that anymore after I cut the sniffling out (but then the output switched back to traditional Chinese again).

Similar in the English model. Pretty clear they trained on YouTube videos where creators will put that in otherwise silent sections to ensure it shows up for people with CC on.
The number one hallucination in my transcriptions was "Subtitles by the Amara.org community".
> I suspect they trained the model on some random YouTube video without carefully picking really useful data.

They trained the model on every YouTube video they could, and hoped the aggregate was useful data.

This reminds me, some years ago as Google was expanding its translation service, someone tried translating text into and out of an obscure African language (don't recall which) and it always came out as weird Biblical-sounding semi-gibberish.

My revelation was that machine translation needs a corpus of bilingual documents to learn from, and if the language is sufficiently obscure, there may not be any bilingual documents except for the Bible, which missionaries have translated into just about every language on Earth.

This is totally happening with other models too, at least with Spanish. Many transcriptions will end with something that roughly translates to "Thanks for watching!" even if it's never present in the original audio.
oh yeah this happens a lot on reddit on videos in foreign languages
lmao