| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tsimionescu 907 days ago
	The model weights clearly encode certain full passages of text, otherwise it would be virtually impossible for the network to produce verbatim copies of text. The format is something very vaguely like "the most likely token after "call" is "me"; the most likely token after "call me" is "Ishmael". It's ultimately a kind of lossy statistical compression scheme at some level.

1 comments

photonthug 907 days ago

> It's ultimately a kind of lossy statistical compression scheme at some level.

And on this subject, it seems worthwhile to note that compression has never freed anyone from copyright/piracy considerations before. If I record a movie with a cell phone at a worse quality, that doesn't change things. If a book is copied and stored in some gzipped format where I can only read a page at a time, or only read a random page at a time, I don't think that's suddenly fair-use.

Not saying these things are exactly the same as what LLMs do, but it's worth some thought, because how are we going to make consistent rules that apply in one case but not the other?

link

fennecbutt 906 days ago

Is it still compression if I read Tolkien and reference similar or exact concepts when writing my own works?

Having a magical ring in my book after I've read lord of the rings, is that copyright?

link

tsimionescu 905 days ago

Generally, no, copyright deals with exact expression, not concepts. However, that can include the structure of a work, so if you wrote a book about little people who form a band together with humans and fairies and a mage to destroy a ring of power created by an ancient evil, where the start in their nice home but it gets attacked by the evil lord's knights [...] you may be breaking Tolkien's copyright.

link

seanmcdirmid 907 days ago

If you watch a bunch of movies then go on to make your own movie based on influence from these movies, you are protected even if you have mentally compressed them into your own movie. At some point, you can learn, be influenced and be inspired from copyrighted material (not copyright infringement), and at some point you are just making a poor copy of the material (definitely copyright infringement). LLMs are probably still at the latter case than the former, but eventually AI will reach the former case.

link

photonthug 907 days ago

There's no obvious need to hold people / AI to same standards here, yet, even if compression in mental-models is exactly analogous to compression in machine-models. I guess we decided already that corporations are already "like" persons legally, but the jury is still out on AIs. Perhaps people should be allowed more leeway to make possibly-questionable derivative works, because they have lives to live, and genuine if misguided creative urges, and bills to pay, etc. Obviously it's quite difficult to try and answer the exact point at which synthesis & summary cross a line to become "original content". But it seems to me that, if anything, machines should be held to higher standard than people.

Even if LLMs can't cite their influences with current technology, that can't be a free pass to continue things this way. Of course all data brokers resist efforts along the lines of data-lineage for themselves and they want to require it from others. Besides copyright, it's common for datasets to have all kinds of other legal encumbrances like "after paying for this dataset, you can do anything you want with it, excepting JOINs with this other dataset". Lineage is expensive and difficult but not impossible. Statements like "we're not doing data-lineage and wish we didn't have to" are always more about business operations and desired profit margins than technical feasibility.

link

seanmcdirmid 907 days ago

> But it seems to me that, if anything, machines should be held to higher standard than people.

If machines achieve sentience, does this still hold? Like, we have to license material for our sentient AI to learn from? They can't just watch a movie or read a book like a normal human could without having the ability to more easily have that material influence new derived works (unlike say Eragon, which is shamelessly Star Wars/Harry Potter/LOTR with dragons).

It will be fun to trip through these questions over the next 20 years.

link

Jensson 907 days ago

As long as machines needs to leech on human creativity those humans needs to be paid somehow. The human ecosystem works fine thanks to the limitations of humans. A machine that could copy things with no abandon however could easily disrupt this ecosystem resulting in less new things being created in total, it just leeches without paying anything back unlike humans.

If we make a machine that is capable of being as creative as humans and train it to coexist in that ecosystem then it would be fine. But that is a very unlikely case, it is much easier to make a dumb bot that plagiarizes content than to make something as creative as a human.

link

seanmcdirmid 906 days ago

> If we make a machine that is capable of being as creative as humans and train it to coexist in that ecosystem then it would be fine. But that is a very unlikely case, it is much easier to make a dumb bot that plagiarizes content than to make something as creative as a human.

I disagree that our own creativity doesn't work that way: nothing is very original, our current art is based on 100k years of building up from when cave man would scrawl simple art into the stone (which they copied from nature). We are built for plagiarism, and only gross plagiarism is seen as immoral. Or perhaps, we generalize over several different sources, diluting plagiarism with abstraction?

We are still in the early days of this tech, we will be having very different conversations about it even as soon as 5 years later.

link

tremon 907 days ago

But that's not what ChatGPT is doing, or is it? ChatGPT watches and records a bunch of movies, then stitches together its own movie using scenes and frames from the movies it recorded. AI will never reach the former case until it learns to operate a camera.

link

seanmcdirmid 907 days ago

How do you not know this isn’t what we are doing in some more advanced form? Anyways, the comparisons will become more apt as the tech advances.

link