| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by araes 858 days ago

The only argument I'm making, is that if 1,000,000 developers all want to train LLMs on video data, because they desperately need to beat Sora, or ChatGPT, or Stable Diffusion, then there's probably a lot rolling their own scraping software.

Probably rolling their own scraping software with inefficient methods. And then likely pseudo-DDOSing (mostly irritating) Google with constant scrape attempts.

I could fight forever against petabytes of constant downloads, or simply make an incredibly small, condensed, easy to download summary that minimizes my data bandwidth cost and reduces each download to bytes - kilobytes rather than 100's of MB.

At 1,250Kbps, 480p, (~Google rec), every user, streaming for an hour, is approximated at 550 MB / hr of data. If the situation gets real bad, and 50% is scrapers (like crawling has gotten to be 50% of the web), and maybe 50% of those can be reduced by a factor of 100, because all they want is the text, then maybe 150 MB can be reduced to 1.5 MB. Close to a 1/4 bandwidth removed.

There may also be a lot that effectively "are" search crawlers, and all they really want is a summary for categorization of videos and better search indexing. Except they download the video, because everybody's rolling their own solutions, and huge portions of StackOverflow and similar amount to "use this code, its invincible." And the people deploying them don't even know what they're doing because its all copy-pasta.

Admittedly, it runs into issues where they then simply download 100x many videos. However, video streams per second, API calls / time, # calls from IP address block / time that are reasonable, could mostly mitigate those issues.

I appreciate you see the irony in the issue, and their cultural opposition is partially what I'm pointing out. Constantly fighting against a deluge when you could just divert the river.