Hacker News new | ask | show | jobs
by canoebuilder 900 days ago
In considering these things, affordances must be made to the new abilities made possible by the computational tools now at our disposal.

Following your line of reasoning if it is perfectly legal to walk into a coffee shop and sit down and listen to what the people next to me are talking about, commit it to memory, even make notes about it, does it then follow that it should be perfectly legal, reasonable, and acceptable for a govt agency or some other organization to put microphones everywhere to record what everyone is talking about, then feed all this data into various databases and modeling systems?

Reciting something in a park is different than selling a copyrighted print of something in a park when you don’t hold the copyright. Which is much closer to what the NYT is accusing OpenAI of.

The training data not “existing” in the model is interesting, but at some point, a distinction without a difference.

If I hire an autistic savant to go to a library and read all the books, then I set up a book selling service where whenever people want to buy a book I have my savant employee type out the book for them, is it then going to pass muster in a copyright case if I tell the judge “It’s okay actually, because the books don’t actually exist in my employee’s brain, merely neuronal encodings of them.” ?

If I have a copyrighted image on which I don’t hold the copyright. But I want to start selling it to people, is it cool if I just run it through a lossless compression algorithm, thereby generating a new encoding of the information and then sell this new encoding along with the software and command to reverse the compression?

Regarding the open source stuff, there I think you might find more favor to your arguments.

But the stuff we are seeing within commercial enterprises like OpenAI and Midjourney is clearly copyright infringement.

And I don’t see copyright law being insane in these cases.

2 comments

It would be perfectly legal that a million government agents went it coffee shops and recorded what they heard. It is the leaving of government property on private property that is the real issue, as well as transparency... not access of information (please don't do this my government).

As far as the savant reading all the books analogy goes... it's a bit off base - mostly because the AI isn't attempting to do that... it would have to be prompted specially (which as far as I understand - what's happening: people giving verbose special prompts to 'extract' copyright... which again... extract - the verbiage regenerate is better, considering there's no guarantee the generation will be a perfect reproduction...) to generated that information. What is happening, (fixing the analogy) the savant reads all the books in the library: then someone asks him to generate a brand new book... which contains some passages that happen to be like those in copy-written works... this is 100% interoperable to what human writers do all the time. Why would we ever want to punish an AI for reading and remembering better than us?

On top of that imperfect reproduction is the sale as if it's the original... that's a lot of additional assumptions to make...

Sadly the lossless compression is also a bad analogy. Math maps and 100% translatable and thus not change/encoding to the bits... if you compresses it lossy, to the point of doing it artistically, then... if none of the bits are the same - it's not the same picture, and doesn't hold any 'bit' of the old image.

Good reply!

> If I hire an autistic savant to go to a library and read all the books, then I set up a book selling service where whenever people want to buy a book I have my savant employee type out the book for them, is it then going to pass muster in a copyright case if I tell the judge “It’s okay actually, because the books don’t actually exist in my employee’s brain, merely neuronal encodings of them.” ?

No, and I do think OpenAI returning copyrighted works verbatim is probably copyright infringement even if it’s “laundered” through a LLM.

However if the autistic savant only provided summaries, analyses, etc that is fair use (IANAL), and should be for LLMs too.

That probably means LLMs will need some sort of scrubbing process to ensure exact training data can’t be reproduced, or if that’s not feasible then some type of output filter that looks for training data (although that would be a problem for open source models)