Hacker News new | ask | show | jobs
by araes 859 days ago
Checking online, this [1] appears to be one of the most heavily referenced on StackOverflow for downloading both user entered and automatically generated transcripts. (Python based)

[1] https://github.com/jdepoix/youtube-transcript-api

Notably, Google really needs to have an obvious API endpoint for this kind of call. If 1000's of programmers are all rolling their own implementation, there's probably a huge number that constantly download the full video and transcribe in data harvesting.

Kind of surprised honestly it's taken this long for Youtube to fall prey to massive data harvesting campaigns. From this article [2] and this paper on Youtube data statistics [3] there are ~14,000,000,000 videos on Youtube with a mean length of 615 seconds (~10 minutes).

You'd think people would be interested in:

  8,610,000,000,000 seconds
  143,500,000,000 minutes
  2,391,666,666 hours
  3,274,083 months
  272,840 years
  27,284 decades
  2,728 centuries
  273 millennia
Of live action video on nearly every single subject in human existence.

Also, the paper's really cool and extremely sobering about being a "content creator" based on the 1% get all views.

[2] "What We Discovered on ‘Deep YouTube’", https://www.theatlantic.com/technology/archive/2024/01/how-m...

[3] "Dialing for Videos: A Random Sample of YouTube", https://journalqd.org/article/view/4066/3766

2 comments

My understanding is that YouTube actively _undermines_ the ability for tools like youtube-dl to download videos. I see the irony that providing an api endpoint (just for transcripts) would maybe save them on egress costs.

But, I think they are probably culturally opposed to publicly exposing this sort of thing, even if it only works via authenticated account. Also worth considering that doing so would make it easier for a competitor to steal the value they provide with the generated closed captions.

The only argument I'm making, is that if 1,000,000 developers all want to train LLMs on video data, because they desperately need to beat Sora, or ChatGPT, or Stable Diffusion, then there's probably a lot rolling their own scraping software.

Probably rolling their own scraping software with inefficient methods. And then likely pseudo-DDOSing (mostly irritating) Google with constant scrape attempts.

I could fight forever against petabytes of constant downloads, or simply make an incredibly small, condensed, easy to download summary that minimizes my data bandwidth cost and reduces each download to bytes - kilobytes rather than 100's of MB.

At 1,250Kbps, 480p, (~Google rec), every user, streaming for an hour, is approximated at 550 MB / hr of data. If the situation gets real bad, and 50% is scrapers (like crawling has gotten to be 50% of the web), and maybe 50% of those can be reduced by a factor of 100, because all they want is the text, then maybe 150 MB can be reduced to 1.5 MB. Close to a 1/4 bandwidth removed.

There may also be a lot that effectively "are" search crawlers, and all they really want is a summary for categorization of videos and better search indexing. Except they download the video, because everybody's rolling their own solutions, and huge portions of StackOverflow and similar amount to "use this code, its invincible." And the people deploying them don't even know what they're doing because its all copy-pasta.

Admittedly, it runs into issues where they then simply download 100x many videos. However, video streams per second, API calls / time, # calls from IP address block / time that are reasonable, could mostly mitigate those issues.

I appreciate you see the irony in the issue, and their cultural opposition is partially what I'm pointing out. Constantly fighting against a deluge when you could just divert the river.

What competitor?
Did you consider that the reason they don't have many competitors is because of that sort of behavior?

To answer more directly, "a hypothetical one", also - I'm speculating and may be wrong.

Why would YT want to give away all this excellent training data for LLMs/AI? My guess is not doing this makes it expensive for those wanting to slurp up data