Hacker News new | ask | show | jobs
by martin-t 292 days ago
1) They absolutely do sometimes repeat training data verbatim.[0]

2) That's not even the point. The point is being trained on stolen data without permission, pretending that the resulting model of the training data is not a derived work of the training data and that the output of the model plus a prompt is not derived work of the training data.

Point 1 is just an extreme edge case which is a symptom of point 2 and yet people still have trouble accepting it.

GPL was about user freedom and now if derived work no longer applies as long as you run code through a sufficiently complex plagiarism automator, plagiarism is unprovable and GPL is broken. Great, we lost another freedom.

[0]: I recall a study or court document with 100 examples of plagiarising multiple whole paragraphs from the New York Times, don't have time to look for it now

4 comments

> I recall a study or court document with 100 examples of plagiarising multiple whole paragraphs from the New York Times, don't have time to look for it now

Convenient. Well then, I recall two studies that said the opposite. Unfortunately pressed for time as well.

https://en.lmgtfy2.com/query/?q=ONE+HUNDRED+EXAMPLES+OF+GPT-...

You didn't have to be rudely dismissive and lie, you chose to.

I would happily respond politely to a polite request.

Please be mindful of your behavior next time.

---

Link for everyone else: https://nytco-assets.nytimes.com/2023/12/Lawsuit-Document-dk...

Not very convincing. If you prompt GPT-4 (nobody uses it) with a huge chunk of an article (nobody does this), sometimes it'll output another chunk of said article. Conveniently omitted, how many attempts did not result in this behavior, how much of the the articles were not repeated (you can see they cut off mid answer)
> trained on stolen data without permission

My sympathies to academic publishers ;)

This all seems totally orthogonal to the statement: "I don't get why people are so resistant to the idea that AI can prove new mathematical theorems."

I don't necessarily disagree about the copyleft stuff.

Transformers do sometimes overfit to exact token sequences from training data, but that isn't really what they the architecture does in general.

When you say new mathematical theorems, they absolutely can. So can infinite monkeys on typewriters, though LLMs have a much better heuristic to arrive at valid trheorems.

The same applies to valid new programs.

The issue I have with this is pretending that the word "new" is sufficient justification for giving all the credit/attribution and subsequent reward (reputational, financial, etc.) to the person who wrote the prompt instead of distributing it to the people in the whole chain of work according to how much work and what quality of work they did.

How many man-hours did it take to create the training data? How many to create the LLM training algorithm and the electricity to run it? How many to write the prompts?

The most work by many, many orders of magnitude was put in by the first group. They often did it with altruistic goals in mind and released their work under permissive or copyleft licenses.

And now somebody found a way to monetize this effort without giving them anything in return. In fact, they will have to pay to access the LLMs which are based on their own work.

Copyright or plagiarism are perhaps the wrong terms to use when talking about it. I think copyright should absolutely apply but it was designed to protect creative works, not code in the first place.

Either way it's a form of industrialized exploitation and we should use all available tools to defend against it.

You're completely correct in your two points, however people _do_ regularly assert that LLMs cannot possibly generate anything novel: "they are just regurgitating and recombining the original".

I mean, sure. But so am I (in what is likely a far more advanced manner, but still). I also find it somewhat funny that I am also partially trained on stolen data without permission. I also jaywalk occasionally (perhaps I am trivializing the topic too much, but show me a researcher who hasn't _once_ downloaded a paper they really needed, in less than perfectly legal ways).

Human time is valuable, LLM time is not. If you spend hundreds of hours creating something, nobody should have the right to copy (verbatim or with automatic modifications) it unless you allow them.

Human rights are valuable. LLMs allow laundering GPL code (removing both attribution and users' rights to inspect and modify the code). Free software cannot compete against proprietary in a world where making a copy is trivial but proving it's a copy is nearly impossible.