Hacker News new | ask | show | jobs
by rcxdude 1 day ago
> Like, you can almost certainly get these LLMs to output gits full source code with some prompting, so there's not that much difference (as much as we like to pretend that AI generated code has no copyright implications)

Are you sure? LLMs are in some way a compressed version of their input but it's a pretty lossy compression (arguably this makes them more like a compression algorithm than a compressed version of the data). I'm not sure you can prompt a full, accurate, copy of a nontrivial codebase out of them. Even with zero temperature their accuracy is just not that high.

1 comments

> I'm not sure you can prompt a full, accurate, copy of a nontrivial codebase out of them. Even with zero temperature their accuracy is just not that high.

Granted, these are some of the most widely spread texts, and not codebases, but just fyi: https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

That paper is basically using the LLM as a compression algorithm: it's prompting with some section of the book and it's reprompting if it doesn't give the right output. Notably this only works if you already have a copy of the book in question!
Distributed a compressed copy of something is still copyright infringement
You misunderstand my point: the LLM is not a losslessly compressed version of the text: you need to supply additional information from the original in order to 'extract' it from the LLM (and from that point of view, the extra information would be the compressed form).