Hacker News new | ask | show | jobs
by fxnn 1034 days ago
For me that discussion is always hard to grasp. When a human would learn coding autodidacticly by reading source code, and later they would write new code — then they could only do so because they read licensed code. No one would ask for the license, right?

So why do we care from where LLMs learn?

5 comments

> So why do we care from where LLMs learn?

Because humans aren't computers and the similarities between the two, other than the overuse of the word "learning" in the computer's case, are nonexistant?

Are you really asserting that these models aren't learning? What definition of learning are you using?
Don't know if they are, and don't really care either and I don't care to anthropomorphize circuitry to the extent that AI proponents tend to, especially.

Humans and Computers are 2 wholly separate entities, and there's 0 reason for us to conflate the two. I don't care if another human looks at my code and straight up copies/pastes it, I care very much if an entity backed by a megacorp like Micro$oft does the same, en-masse, and sells it for profit, however.

Okay, so the scale at which they sale their service is a good argument that this is different from a human learning.

However, on the other hand we also have the scale at which they learn, which kind of makes every individual source line of code they learn from pretty unimportant. Learning at this scale is statistical process, and in most cases individual source snippets diminish in the aggregation of millions of others.

Or to put it the other way round, the actual value lies in the effort of collecting the samples, training the models, creating the software required for the whole process, putting everything into a good product and selling it. Again, in my mind, the importance of every individual source repo is too small at this scale to care about their license.

The idea that individual source snippets at this scale diminish in aggregation, is undercut by the fact that OpenAI and MSFT are both selling enterprise-flavoured versions of GPT, and the one thing they promise is that enterprise data will not be used to further train GPT.

That is a fear for companies because the individual source snippets and the knowledge "learned" from them is seen as a competitive advantage of which the sources are an integral part - and I think this is a fair point from their side. However then the exact same argument should apply in favour of paying the artists, writers, coders etc whose work has been used to train these models.

So it sounds like they are trying to have their cake and eat it too.

Hmm. You sure this is the same thing? I would say it’s more about confidentiality than about value.

Because what companies want to hide are usually secrets, that are available to (nearly) no one outside of the company. It’s about preventing accidental disclosure.

What AIs are trained on, on the other hand, is publicly available data.

To be clear: what could leak accidentally would have value of course. But here it’s about the single important fact that gets public although it shouldn’t, vs. the billions of pieces from which the trained AI emerges.

It's really not different in scale. Imagine for a moment how much storage space it would take to store the sensory data that any two year old has experienced. That would absolutely dwarf the text-based world the largest of LLMs have experienced.
If you don't care, why are you confidently asserting things you're not even interested in examining? It just drowns out useful comments.
Do humans really read terabytes of C code to learn C?

Humans look at a few examples and extrapolate…

But that also exists in the AI world. It’s called „fine tuning“: a LLM trained on a big general dataset can learn special knowledge with little effort.

I’d guess it’s exactly the same with humans: a human that received good general education can quickly learn specific things like C.

Humans have experienced an amount of data that absolutely dwarfs the amount of data even the largest of LLMs have seen. And they've got billions of years of evolution to build on to boot
You're straying away. Let's talk about learning C.

Also humans didn't evolve in billion of years.

The process of evolution "from scratch", i.e. from single-celled organisms took billions of years.

This is all relevant because humans aren't born as random chemical soup. We come with pre-trained weights from billions of years of evolution, and fine-tune that with enormous amounts of sensory data for years. Only after that incredibly complex and time-consuming process does a person have the ability to learn from a few examples.

An LLM can generalize from a few examples on a new language that you invent yourself and isn't in the training set. Go ahead and try it.

There is the element of the unknown with LLMs etc.

There is a legal difference between learning from something and truly making your own version and simply copying.

It's vague of course - take plagiarism in a university science essay - the student has no original data and very likely no original thought - but still there is a difference between simply copying a textbook and writing it in your own words.

Bottom line - how do we know the output of the LLM isn't a verbatim copy of something with the license stripped off?

> So why do we care from where LLMs learn?

same difference there is between painting your own fake Caravaggio and buying a fake Caravaggio (or selling the one you made).

the second one is forgery, the first one is not.

The way I see it is that with AI you have really painted your own Caravaggio, but instead of an electrochemical circuit of a human brain you've employed a virtual network.
> but instead of an electrochemical circuit of a human brain you've employed a virtual network.

technically it is still a tool you are using, differently from doing it on your own, with your hands, using your own brain cells, that you trained over the decades, instead of using a virtual electronic brain pre-trained in hours/days by someone else on who knows what.

Okay if it’s about looking at one painting and fake that. However, if you train your model on billions of paintings and create arbitrary new ones from that, it’s just a statistical analysis on what paintings in general are made of.

The importance of the individual painting diminishes at this scale.

And if you look at lots of paintings, and create a new painting which is in a very similar style to an existing painting?

Is that a forgery? Have you infringed on the copyright on all the paintings you looked at?

Why do people bring this up? People are not LLMs and the issues are not the same.
I'd add to this, the damage an LLM could do is much less than a human could do in terms of individual production. A person can paint so many forgeries... A machine can create many, many more. The dilusion of value from a person learning is far different than machine learning. The value extracted and diluted is night and day in terms of scale.

Not to say what will/won't happen. In practice, what I've seen doesn't scare me much in terms of what LLMs produce vs. what a person has to clean up after it's produced.

Why are the issues not the same? Are you privileging meat over silicon?
Yes they are. Most people will.

They are not the same because an LLM is a construct. It is not a living entity with agency, motive, and all the things the law was intended for.

We will see new law as this tech develops.

For an analogy, many people call infringement theft and they are wrong to do so.

They will focus on the someone getting something without having followed the right process part while ignoring the equally important someone else being denied the use of, or loss of property part.

The former is an element in common between theft and infringement. And it is compelling!

But, the real meat in theft is all about people losing property! And that is not common at all.

This AI thing is similar. The common elements are super compelling.

But it just won't be about that in the end. It will be all about the details unique to AI code.

Using the word "construct" isn't adding anything to the conversation. If we bioengineer a sentient human, would you feel OK torturing it because it's "just a construct"? If that's unethical to you, how about half meat and half silicon? How much silicon is too much silicon and makes torture OK?

> Most people will [privilege meat]

"A person is smart. People are dumb, panicky dangerous animals, and you know it". I agree that humans are likely to pass bad laws, because we are mostly just dumb, panicky dangerous animals in the end. That's different than asking an internet commentor why they're being so confident in their opinions though.

Companies don't go around donating their source code to universities either, even if it was only for the purpose of learning.
> So why do we care from where LLMs learn?

Because humans dont put the "Shutterstock" watermark logo on the images they produce.

As with all absolutes* exceptions exist:

Viagra Boys - In Spite Of Ourselves (with Amy Taylor)

    I absolutely love that the entirety of the video is unpurchased stock footage with the watermark still on it. This is cinematic gold.
https://www.youtube.com/watch?v=WLl1qpDL7YA

* well, most ...

cargo cult programming is real though