Hacker News new | ask | show | jobs
by spxneo 811 days ago
isnt that what Google did ? they scraped the internet but the public/econ advisors felt the benefits outweighed copyright violations, they were just "indexers", they weren't scraping "news" they were indexing it lol

same thing with emulators and roms. somebody dumped the cartridges (copyrighted software) into ROM files to be played on emulators (copyrighted bios) but they were "archiving" and if you owned the original copy you could download them. I still vividly remember seeing on warez website disclaimer: "DMCA SAFE HARBOUR NOTICE: YOU MUST OWN THE ORIGINAL GAME OTHERWISE ITS ILLEGAL BUT YES, YOU CAN DOWNLOAD EVERY SINGLE GAME MADE ON THAT CONSOLE FOR FREE"

I feel like the same outcome will be for LLMs trained on copyrighted material. It will be "training". The net benefit is too great than fretting over "training"

tldr: "indexing" ---> "archiving" ---> "training"

2 comments

Google surfaces data — or it used to — LLMs and AI companies actively exploit it with zero benefit given to creators or users of the platforms they're now cannibalizing.
the irony. im surprised how businesses built on selling google search results is allowed to exist. i guess for the same reason google scraping the internet and building a product on top of it is allowed.

then it only makes sense scraped AI training data is also going to be tolerated because you would need to reproduce a large language model like ChatGPT using your copyrighted content can produce a similar derivative of your copyrighted content by doing forensic analysis.

its such an uphill battle for copyright holders. They need to replicate: copyrighted input ---> LM similar to ChatGPT4 ---> copyrighted output

So far its not looking good for OpenAI because its possible to generate copyrighted output (type spiderman in czech) so all that remains is demonstrating the middle layer (training it on LM similar to ChatGPT4) but that is unrealistically expensive.

I have theory that all this money spent on large models is to make it impossible for discovery (as it would require access to $100 billion GPUs)

The whole notion that AI can replace search is nonsense. It yields no benefit to the creators of the results it scrapes and the models hallucinate. It's worse for users and it's worse for everyone producing anything of note online.
but many chatgpt users are not using Google as much instead relying on LLMs + RAG

ChatGPT is the new search engine and provides far more value to the end user than Google.

The issue seems to be people want a payout from OpenAI...but its non-profit

It's a shiny toy — it'll yield worse answers. Much like Google's own AI.
Google search is terrible. Chatgpt is definitively better for searching right now, and i often find myself reaching for it over google for a wide category of questions.
The same benefit doesn’t exist for ChatGPT as Google because Google means people click on your site and you get ad revenue. Google even facilitates this in both directions with search ads and as an ad service you can get paid from for hosting ads. The ROM site DMCA thing was always BS lmao it’s completely legal for you to dump your own carts and use them in emulators but that freedom doesn’t extend to having a copy of someone else’s game cart. That’s just an intentional misunderstanding of the DMCA in a futile attempt to not get banned
so you think scraping copyrighted content to sell ads is okay and downloading copyrighted games for free is also okay then why is it not okay for ChatGPT to train itself on scraped content?
It's not scraping, it's indexing and linking out to creators. LLMs are helping themselves to everything with no regard for content creators. They should be subject to copyright claims — I don't care if it destroys their business, they should've considered that at the outset. They didn't then and they don't care to now, they're simply greedy and looking to build something that benefits themselves and their investors with no regard for anyone they step on to do so.
but how can you prove that your picture of a cat was used in LLM?

if you owned a franchise called "Chicken Brothers" with a the logo of two chickens standing side by side with arms crossed proudly then do you have claim over all derivatives including the spanish name generated by LLM?

i just dont think its straight forward, the main complaint should be payout for license used during training but its tough to prove unless someone at OpenAI dumps the AWS cloudwatch logs

That's OpenAI's problem and the burden should be on them.
The first part is fine because the search engine blurb isn’t a replacement for the thing itself. And I disagree with what ROM sites claim, you can’t just dump ROMs online and claim it’s not copyright infringement