Hacker News new | ask | show | jobs
by lelanthran 157 days ago
> To call training illegal is similar to calling reading a book and remembering it illegal.

A type of wishful thinking fallacy.

In law scale matters. It's legal for you to possess a single joint. It's not legal to possess 400 tons of weed in a warehouse.

2 comments

It is not the scale that matters here, in your example, but intent. With 1 joint, you want to smoke yourself. With 400, you very possibly want to sell it to others. Scale in itself doesnt matter, scale matters only as to the extent it changes what your intention may be.
> It is not the scale that matters here, in your example, but intent. With 1 joint, you want to smoke yourself. With 400, you very possibly want to sell it to others. Scale in itself doesnt matter, scale matters only as to the extent it changes what your intention may be.

It sounds then like you're saying that scale does indeed matter in this context, as using every single piece of writing in existence isn't being slurped up purely to learn, it's being slurped up to make a profit.

Do you think they'd be able to offer a usefull LLM if the model was trained only what what an average person could read in a lifetime?

It's common knowledge among LLM experts that the current capabilities of LLMs are triggered as emergent properties of training transformers on reams and reams of data.

That is intent of scale. To trigger LLMs to reach this point of "emergence". Whether or not it's AGI is a debate I'm not willing to entertain but everyone pretty much agrees that there's a point where the scale flips from a transformer being an autocomplete machine to something more than that.

That is legal basis for why companies would go for scale with LLMs. It's the same reason why people are allowed to own knives even though knives are known to be useful for murder (as a side effect).

So technically speaking these companies have legal runway in terms of intent. Making an emergent and helpful AI assistant is not illegal, but also making a profit isn't illegal either.

Right, but in the weed analogy, the scale is used as a proxy to assume intent. When someone is caught with those 400 joints, the prosecution doesn't have to prove intent, because the law has that baked in already.

You could say the same in LLM training, that doing so at scale implies the intent to commit copyright infringement, whereas reading a single book does not. (I don't believe our current law would see it this way, but it wouldn't be inconsistent if it did, or if new law would be written to make it so.)

It’s clear nvidia and every single one of these big AI corps do not want their AIs to violate the law. The intent is clear as day here.

Scale is only used for emergence, openAI found that training transformers on the entire internet would make is more then just a next token predictor and that is the intent everyone is going for when building these things.

I don't think that's clear at all. Businesses routinely break the law if they believe the benefits in doing so will outweigh the consequences.

I think this is even more common and more brazen when it comes to "disruptive" businesses and technologies.

>Businesses routinely break the law if they believe the benefits in doing so will outweigh the consequences.

I'm saying there's collective incentive among businesses to restrict the LLM from producing illegal output. That is aligned and ultra clear. THAT was my point.

But if LLMs produce illegal output as a side effect and it can't be controlled than your point comes into play here because now they have to weigh the cost + benefit as they don't have a choice in the matter. But that wasn't what I'm getting at. That's your new point, which you introduced here.

In short it is clear all corporations do not want LLMs to produce illegal content and are actively trying to restrict it.

Er no. I’ve read and remember hundreds of books in my life time. It’s not any more illegal based off scale. The law doesn’t differentiate whether I remember one book or a hundred then there’s no difference for thousands or millions.

No wishful thinking here.

> Er no. I’ve read and remember hundreds of books in my life time. It’s not any more illegal based off scale.

I'm not sure you understood what you said, but superficially it appears that you are agreeing with me?

Just because it's legal to read 100s of books does not make it legal to slurp up every single piece of produced content ever recorded.

We're talking man many orders of magnitude in scale there, and you're the one who pointed out that scale :-/

No I'm not agreeing with you.

>Just because it's legal to read 100s of books does not make it legal to slurp up every single piece of produced content ever recorded.

The law says you're perfectly in your legal right to slurp up every piece of content ever produced.

>We're talking man many orders of magnitude in scale there, and you're the one who pointed out that scale :-/

I'm aware, and the law doesn't talk about scale.

What is "scale" in this context? I think arguably 100 books over the span of decades is not "scale".

But tens (hundreds?) of thousands of books over the span of a few weeks? That's definitely "scale".

the law doesn't talk about scale, so either is perfectly legal. Memorizing a billion books vs memorizing one book. Same laws apply.