| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Workaccount2 531 days ago
	LLMs are not massive archives of data. They are a tiny fraction of a fraction of a percent of the size of their training set. And before you knee-jerk "it's a compression algo!", I invite you to archive all your data with an LLMs "compression algo".

3 comments

timewizard 531 days ago

> LLMs are not massive archives of data.

Neither am I, yet, I am still capable of reproducing copyrighted works to a level that most would describe as illegal.

> And before you knee-jerk "it's a compression algo!"

It's literally a fundamental part of the technology so I can't see how you call it a "knee jerk." It's lossy compression, the same way a JPEG might be, and simply recompressing your picture to a lower resolution does not at all obviate your copyright.

> I invite you to archive all your data with an LLMs "compression algo".

As long as we agree it is _my data_ and not yours.

link

Isamu 531 days ago

> It's lossy compression, the same way a JPEG might be

Compression yes, but this is co-mingling as well. The entire corpus is compressed together, which identifies common patterns, and in the model they are essentially now overlapping.

The original document is represented statistically in the final model, but you’ve lost the ability to extract it closely. Instead you gain the ability to generate something statistically similar to a large number of original documents that are related or are structurally similar.

I’m just commenting, not disputing any argument about fair use.

link

BobbyTables2 531 days ago

Copying a single sentence verbatim from a 1000 page book is still plagiarism.

And is technically copyright infringement outside fair use exceptions.

link

concerndc1tizen 531 days ago

And similarly, translating those sentences into data points is still a derivative work, like transcribing music and then making a new recording is still derivative.

link

jpollock 531 days ago

derivative works still tend to be copyright violations.

link

concerndc1tizen 531 days ago

Yes, that's what I'm saying. An LLM washing machine doesn't get rid of the copyright.

link

int_19h 531 days ago

It doesn't matter. It's still a derived work.

link

baxtr 531 days ago

Well what isn’t in this world?

Would Einstein would have been possible without Newton?

link

int_19h 529 days ago

I'm fine with us ditching copyright altogether.

But as things are, the megacorps are training their LLMs on the commons while asserting "intellectual property" rights on the resulting weights. So, fuck them, and cheers to those who try to do something about this state of affairs.

link

thedailymail 531 days ago

Newton was public domain by Einstein's time.

link

jampekka 531 days ago

Indeed. Copyright was introduced in 1710, Principia was published in 1687.

link

yieldcrv 531 days ago

and even with our current copyright laws providing for long dated protection, it would have still been in public domain

link

jampekka 531 days ago

It's hard to say what the current laws actually imply. Steamboat Willie was originally meant to be in the public domain in 1955. Got there in 2024.

link