Hacker News new | ask | show | jobs
by jameshart 1089 days ago
There’s a clear separation between the training process which looks at code and outputs nothing but weights, and the generation process which takes in weights and prompts and produces code.

The weights are an intermediate representation that contains nothing resembling the original code.

4 comments

But the original content is frequently recoverable.

You can't just take copyrighted code, base 64 it, sent it to someone, have them decode it, and claim there was no copyright violation.

From my (admittedly vague) understanding copyright law cares about the lineage of data, and I don't see how any reasonable interpretation could consider that the lineage doesn't pass through models.

IANAL

> But the original content is frequently recoverable.

What if we train the model on paraphrases of the copyrighted code? The model can't reproduce exactly what it has not seen.

Also consider the size ratio - 1TB of code+text ends up into 1GB of model weights. There is no space to "memorize" the training set, it can only learn basic principles and how to combine them to generate code on demand.

The copyright law in principle should only protect expression, not ideas. As long as the model learns the underlying principles without copying the superficial form, it should be ok. That's my 2c

The fact that this is a problem is a bug in copyright law, not a shortcoming of the LLM.
The neurons in my brain when I plagiarize are just arrangements of atoms that contain nothing that resembles orginal code/text passages/etc.
The trained weights of a GPT model are a frozen, static, transmissible representation. They’re not equivalent to the live state of a brain.
Pretty equivalent to the snapshot of a live brain. Those inside it are even called neurons and neural network
No, they are the weights that are used to configure a neural network. They’re a map of how to build a useful brain, not a neural state.
Machine learning neural networks have almost nothing to do with how brains work besides a tenuous mathematical relation that was conceived in the 1950s.
You can say that if you want to nitpick, but there are recent studies showing that neural and brain representations align rather well, to the point that we can predict what someone is seeing from brain waves, or generate the image with stable diffusion.

https://sites.google.com/view/stablediffusion-with-brain/

I think brain to neural net alignment is justified by the fact that both are the result of the same language evolutionary process. We're not all that different from AIs, we just have better tools and environments, and evolutionary adaptation for some tasks.

Language is an evolutionary system, ideas are self replicators, they evolve parallel to humans. We depend on the accumulation of ideas, starting from scratch would be hard even for humans. A human alone with no language resources of any kind would be worse than a primitive.

The real source of intelligence is the language data from which both humans and AIs learn, model architecture is not very important. Two different people, with different neural wiring in the brain, or two different models, like GPT and T5 can learn the same task given the training set. What matters is the training data. It should be credited with the skills we and AIs obtain. Most of us live our whole lives at this level and never come up with an original idea, we're applying language to tasks like GPT.

> The weights are an intermediate representation that contains nothing resembling the original code.

So is the ELF.

I think this view is incredibly dangerous to any kind of skills mastery. It has the potential to completely destroy the knowledge economy and eventually degrade AI due to a dearth of training data.
It reminds me of people needing to do a "clean room implementation" without ever seeing similar code. I feel like a human being who read a bunch of code and then wrote something similar without copy/paste or looking at the training data should be protected, and therefore an AI should too.
Okay, that’s an argument from consequences, but is the view factually wrong?
I mean those consequences are why patent law exists. New technology may require new regulatory frameworks, like we've been doing since railroads. The idea that we could not amend law and that we need to pedantically say "well this isn't illegal now" as an excuse for doing something unethical and harmful to the economy is in my opinion very flawed.
Is it really harmful to the economy, or only to entrenched players? Coding AI should be a benefit to many, like open source is. It opens the source even more, should be a dream come true for the community. It's also good for learning and lowering the entry barrier.

At the same time it does not replace human developers in any application, it might take a long time until we can go on vacation and let AI solve our Jira tickets. Remember the Self Driving task has been under intense research for more than a decade now, and it's still far from L5.

It's a trend that holds in all fields. AI is a tool that stumbles without a human to wield it, it does not replace humans at all. But with each new capability it invites us to launch new products and create jobs. Human empowerment without human replacement is what we want, right?