Hacker News new | ask | show | jobs
by Workaccount2 531 days ago
LLMs are not massive archives of data. They are a tiny fraction of a fraction of a percent of the size of their training set.

And before you knee-jerk "it's a compression algo!", I invite you to archive all your data with an LLMs "compression algo".

3 comments

> LLMs are not massive archives of data.

Neither am I, yet, I am still capable of reproducing copyrighted works to a level that most would describe as illegal.

> And before you knee-jerk "it's a compression algo!"

It's literally a fundamental part of the technology so I can't see how you call it a "knee jerk." It's lossy compression, the same way a JPEG might be, and simply recompressing your picture to a lower resolution does not at all obviate your copyright.

> I invite you to archive all your data with an LLMs "compression algo".

As long as we agree it is _my data_ and not yours.

> It's lossy compression, the same way a JPEG might be

Compression yes, but this is co-mingling as well. The entire corpus is compressed together, which identifies common patterns, and in the model they are essentially now overlapping.

The original document is represented statistically in the final model, but you’ve lost the ability to extract it closely. Instead you gain the ability to generate something statistically similar to a large number of original documents that are related or are structurally similar.

I’m just commenting, not disputing any argument about fair use.

Copying a single sentence verbatim from a 1000 page book is still plagiarism.

And is technically copyright infringement outside fair use exceptions.

And similarly, translating those sentences into data points is still a derivative work, like transcribing music and then making a new recording is still derivative.
derivative works still tend to be copyright violations.
Yes, that's what I'm saying. An LLM washing machine doesn't get rid of the copyright.
It doesn't matter. It's still a derived work.
Well what isn’t in this world?

Would Einstein would have been possible without Newton?

I'm fine with us ditching copyright altogether.

But as things are, the megacorps are training their LLMs on the commons while asserting "intellectual property" rights on the resulting weights. So, fuck them, and cheers to those who try to do something about this state of affairs.

Newton was public domain by Einstein's time.
Indeed. Copyright was introduced in 1710, Principia was published in 1687.
and even with our current copyright laws providing for long dated protection, it would have still been in public domain
It's hard to say what the current laws actually imply. Steamboat Willie was originally meant to be in the public domain in 1955. Got there in 2024.