| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by philipportner 49 days ago

> I'm not sure you can prompt a full, accurate, copy of a nontrivial codebase out of them. Even with zero temperature their accuracy is just not that high.

Granted, these are some of the most widely spread texts, and not codebases, but just fyi: https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

1 comments

rcxdude 48 days ago

That paper is basically using the LLM as a compression algorithm: it's prompting with some section of the book and it's reprompting if it doesn't give the right output. Notably this only works if you already have a copy of the book in question!

link

20k 48 days ago

Distributed a compressed copy of something is still copyright infringement

link

rcxdude 47 days ago

You misunderstand my point: the LLM is not a losslessly compressed version of the text: you need to supply additional information from the original in order to 'extract' it from the LLM (and from that point of view, the extra information would be the compressed form).

link