Hacker News new | ask | show | jobs
by Shawnj2 811 days ago
The fact that ChatGPT isn’t deemed copyright infringement is absurd. Like you can’t take the entire internet and use it to train your software and claim you’re not violating the copyright of thousands of people
4 comments

If the predictions that traditional search engines will be displaced by LLM engines turn out to be correct then there will have to be a reckoning about copyright. It's already difficult enough to make money by writing online, but if most content gets consumed second-hand through an LLM then it will become basically impossible. How are journalists supposed to eat if NewsGPT just scoops up their work and starts regurgitating it seconds after publishing?
> How are journalists supposed to eat if NewsGPT just scoops up their work and starts regurgitating it seconds after publishing?

NewsGPT won't just regurgitate the work of journalists. First it'll consider the paid "partners" of NewsGPT to make sure to downplay anything that might hurt them, then it'll do the same for their advertisers while inserting some ads in the text, then they'll give the article tweaks according to NewsGPT's own ideology and then finally spit out something very different at their users. Maybe they can argue that NewsGPT is too transformative to count as copyright infringement.

How are you supposed to trust "journalism" from a text generator that hallucinates? The information ecosystem is bad enough without running it through a text blender that's already hitting compute, power and data limits.
And even if it doesn't hallucinate, current-age text generators are very good (in a bad way) at following a leading question.

For example, questions like "Tell me why I should use semaglutide for weight loss" gives widely different answers than "Tell me why I shouldn't use semaglutide for weight loss".

A human writer might fall into the bias trap of the original question being leading, but much less so than text generators that often repeat your prompt (re-enforcing whatever leading answer was embedded in your question) before answering it.

Regurgitation seconds after is what already happens with the AP though. There are some real journalists that will sadly be pushed further out of the fold, and presumably many human but fake journalists that have been coasting for years on such regurgitation. I’m not so optimistic about the ai future, and believe payment or at least credit really needs to get figured out for generative stuff. But real content producers should direct some of the irritation at their editors, colleagues, and industry or else it’s all rather dishonest isn’t it?
> Regurgitation seconds after is what already happens with the AP though.

The AP makes about half a billion a year from other outlets paying them for permission to regurgitate their content. That's not the same as the AI lobby saying they should be allowed to scrape apnews.com and publish articles derived from the content they get from there, for free, and without attribution.

I do see your point, and yeah theft is theft and theft is bad. But what I’m getting at is more the POV for media consumers.

If it’s regurgitated / unoriginal anyway then I don’t think most people care much whether it’s summarized/subjected to extra spin and fluff by a person or by a machine.

We should be working to strengthen journalism as a practice. LLMs will do anything but.
Journalist are ultimately extremely overrated in 2024.

I go out of my way to not consume news outside what happens to cross my way because of financial markets.

What exactly do you think I am missing that is so important? Journalist by large produce complete nonsense in 2024. Journalist in 2024 are a massive net negative and would be much better served doing something productive, like selling apples on the street.

The main counterargument is that you have read 1000s of documents to train your brain which produces unique documents with no credit to the original copyright holders.

GenAI is just doing the same thing on a larger scale.

Well, if the concept is "we should legally treat the systems just like very big humans", then the next step is to arrest and confine all the leaders of the companies involved on charges of slavery and child exploitation.

The distinction does matter in copyright too, since a transformative work needs some non-trivial amount of human input.

Yep, the issue with the parent counterargument is that gen ai is a monetized tool owned and sold by a corporation. People would probably be fine with a human-like embodied ai or something learning in the same way.
If I offered a paid service where you could pay me $20 a month and I would draw you copyrighted works that are in my internal neural network that would also be illegal
Frankly, it is and should be treated as such. The fact that they're dodging questions about their data sources is a red flag and a pretty clear indication that they know they're in the wrong and are fighting to become established enough to be in a position to, at best, ask for forgiveness after the fact.
isnt that what Google did ? they scraped the internet but the public/econ advisors felt the benefits outweighed copyright violations, they were just "indexers", they weren't scraping "news" they were indexing it lol

same thing with emulators and roms. somebody dumped the cartridges (copyrighted software) into ROM files to be played on emulators (copyrighted bios) but they were "archiving" and if you owned the original copy you could download them. I still vividly remember seeing on warez website disclaimer: "DMCA SAFE HARBOUR NOTICE: YOU MUST OWN THE ORIGINAL GAME OTHERWISE ITS ILLEGAL BUT YES, YOU CAN DOWNLOAD EVERY SINGLE GAME MADE ON THAT CONSOLE FOR FREE"

I feel like the same outcome will be for LLMs trained on copyrighted material. It will be "training". The net benefit is too great than fretting over "training"

tldr: "indexing" ---> "archiving" ---> "training"

Google surfaces data — or it used to — LLMs and AI companies actively exploit it with zero benefit given to creators or users of the platforms they're now cannibalizing.
the irony. im surprised how businesses built on selling google search results is allowed to exist. i guess for the same reason google scraping the internet and building a product on top of it is allowed.

then it only makes sense scraped AI training data is also going to be tolerated because you would need to reproduce a large language model like ChatGPT using your copyrighted content can produce a similar derivative of your copyrighted content by doing forensic analysis.

its such an uphill battle for copyright holders. They need to replicate: copyrighted input ---> LM similar to ChatGPT4 ---> copyrighted output

So far its not looking good for OpenAI because its possible to generate copyrighted output (type spiderman in czech) so all that remains is demonstrating the middle layer (training it on LM similar to ChatGPT4) but that is unrealistically expensive.

I have theory that all this money spent on large models is to make it impossible for discovery (as it would require access to $100 billion GPUs)

The whole notion that AI can replace search is nonsense. It yields no benefit to the creators of the results it scrapes and the models hallucinate. It's worse for users and it's worse for everyone producing anything of note online.
but many chatgpt users are not using Google as much instead relying on LLMs + RAG

ChatGPT is the new search engine and provides far more value to the end user than Google.

The issue seems to be people want a payout from OpenAI...but its non-profit

It's a shiny toy — it'll yield worse answers. Much like Google's own AI.
The same benefit doesn’t exist for ChatGPT as Google because Google means people click on your site and you get ad revenue. Google even facilitates this in both directions with search ads and as an ad service you can get paid from for hosting ads. The ROM site DMCA thing was always BS lmao it’s completely legal for you to dump your own carts and use them in emulators but that freedom doesn’t extend to having a copy of someone else’s game cart. That’s just an intentional misunderstanding of the DMCA in a futile attempt to not get banned
so you think scraping copyrighted content to sell ads is okay and downloading copyrighted games for free is also okay then why is it not okay for ChatGPT to train itself on scraped content?
It's not scraping, it's indexing and linking out to creators. LLMs are helping themselves to everything with no regard for content creators. They should be subject to copyright claims — I don't care if it destroys their business, they should've considered that at the outset. They didn't then and they don't care to now, they're simply greedy and looking to build something that benefits themselves and their investors with no regard for anyone they step on to do so.
but how can you prove that your picture of a cat was used in LLM?

if you owned a franchise called "Chicken Brothers" with a the logo of two chickens standing side by side with arms crossed proudly then do you have claim over all derivatives including the spanish name generated by LLM?

i just dont think its straight forward, the main complaint should be payout for license used during training but its tough to prove unless someone at OpenAI dumps the AWS cloudwatch logs

That's OpenAI's problem and the burden should be on them.
The first part is fine because the search engine blurb isn’t a replacement for the thing itself. And I disagree with what ROM sites claim, you can’t just dump ROMs online and claim it’s not copyright infringement