Hacker News new | ask | show | jobs
by triceratops 403 days ago
The counterargument to that is model training is impossible without making copies. That's not true for humans.
2 comments

That's not really true. Models train (in greatly simplified way) by being shown an excerpt and being told to guess the next token from the excerpt. They push around their weights until the token they output matches the next token in the excerpt. Then the excerpt is no longer needed. You can think of it like the article is loaded, the LLM plays this token guessing game through it, then the article is discarded. On the face of it this is what happens, but it gets hairier depending on how exactly this process is done. But it is seemingly not far removed from how humans consume content (acquire, read, discard), hence the legal blur.
> by being shown an excerpt [of copyrighted material]

How is this done? Are bits not written into RAM or disk? Are they not sent between machines in a training cluster? That's copying.

> it is seemingly not far removed from how humans consume content

Except that humans don't make full copies to RAM, or disk or paper.

The is a bar of usage built into the law, otherwise everyone who reads this wired article is violating copyright by making a full copy to their computer. Generally making non-lasting copies is fine, otherwise the internet wouldn't work.

AI doesn't need lasting copies to train, however I don't know what the actual implementation is. But if it's ruled that they can only use copyrighted data if it's not stored for more than the time it would take a human to consume, It wouldn't really cripple the models, but perhaps make training more logistically challenging.

It's important to understand that models are not data archives. They are statistical constructs made from getting quizzed, that uses human made content to generate the quiz questions.

> otherwise everyone who reads this wired article is violating copyright by making a full copy to their computer

Wired explicitly sent that article to their computer for the purposes of reading it so it's not a copyright violation.

> Except that humans don't make full copies to RAM, or disk or paper.

Images on your retina form exact copies.

They are scanned and translated into impulses that are then sent to a first set of "neural columns" - that's an exact copy.

This is then connected to the visual cortex by the two most high bandwidth links in the human body ("the optical nerve", there's 2 of them of course, always wondered why everybody insists on using the singular). Why would you have that high bandwidth link unless to create verbatim copies.

The way those columns are structured also very strongly suggests they make carbon copies, which they then make available on the "brain bridge" (which is probably at least vaguely similar to the "attention matrix" of a transformer). If it does work like that, that's also a verbatim copy.

The only way "humans don't make full copies to RAM" is that humans don't have separate RAM. The processing power is colocated with the processing, even on a microscopic level. You know, what everybody knows is the best way of doing things even in silicon, it's just incredibly impractical if you can't rebuild your circuit every time there's a slight change to the instructions your "computer" carries out (the brain is not a "Von Neumann architecture", except it kind of is when it regrows connections. But in the short term it isn't)

> that's an exact copy.

Not for the purposes of copyright law.

> is that humans don't have separate RAM [or disk]

And that turns out to be incredibly important. Humans can't create a lasting, shareable copy of a copyrighted work by consuming it.

Sure they can. You can learn a copyrighted work by hard, even indirectly, then quickly duplicate it by hand. Mozart was originally famous for making a business out of that.
> then quickly duplicate it by hand

And that's a copyright violation.

It is different thing. When you copy data into computer's RAM, that might be copying as defined in law [1]:

> Using software almost always involves creating copies, even though many of these copies only exist for a very short time. For example, executing a program means copying it from the hard disk into RAM so that the CPU can interpret the instructions. Because of this, the right to run a program is considered to fall under the copyright of the author.

For comparison, when a human looks at the letters, there is no copying.

Also, models can reproduce text verbatim which proves that they store it.

So it is unfair when ordinary folks got sued for this and Zuckerberg wants to get away with a million times larger violation. He must go directly to jail.

[1] https://www.iusmentis.com/copyright/software/rights/

It's also true for humans, you memorize only parts of what you read and see but you still had to view the whole thing first.

The computer model is working differently of course but functionally it's the same idea.

God I hate this conversation so much. These cases have nothing to do with how the brain works.
That's why the keyword here is functionally.