| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by red75prime 19 days ago

> This is not a hypothetical scenario; I’ve personally encountered a case of someone using an LLM attempt to contribute code I recognized from a specific Open Source project under one license to another project under a different license

You say you "recognized code". Does it mean that you weren't able to find the exact match?

> an LLM is actually just regurgitating portions of its inputs

You seem to be talking about the inputs to the autoregressive pretraining stage. Correct? Then it's not how LLMs work, unless we use a definition of portions as a "few letters blocks."

1 comments

eschaton 19 days ago

I found exact matches. I also found inexact matches, where C functions had been turned into C++ member functions and the like. “Recognized” does not somehow imply a lack of precision.

The LLM the person used was trained on a very large corpus of Open Source code, and reproduced that code exactly. Just like LLMs have reproduced chapters of books and articles from the New York Times exactly.

link

red75prime 19 days ago

> I found exact matches.

Were those functions trivial? With, say, 1% probability of someone who have not seen them writing them like that?

> Just like LLMs have reproduced chapters of books and articles from the New York Times exactly.

Have you read the articles? As far as I remember they fed large chunks of an article multiple times to an LLM to sometimes get a not-so-long exact match. It can mean that LLMs can infer a style and humans are predictable.

link

Topfi 18 days ago

> […] fed large chunks of an article multiple times to an LLM […]

So they had to prompt? An LLM? I got this argument before and still don’t get what it’s trying to say. These models do not output anything unless prompted, that’s not any kind of gotcha.

On the code outputting front there is a lot of relevant evidence beyond the NYC lawsuit [0].

If I slightly modify GPL code, that doesn’t give me the right to relicense.

[0] https://arxiv.org/html/2601.02671?amp=&amp= and https://arxiv.org/abs/2506.12286 and https://ai.stanford.edu/blog/verbatim-memorization/

link

eschaton 19 days ago

No, the functions weren’t trivial, and a lot of the surrounding code and structure bore substantial similarities as well. If you saw the two files next to each other, you’d assume it was the result of a copy-paste-adjust process if you didn’t know an LLM was involved.

link

red75prime 19 days ago

I can only speculate that the model that generated the code hasn't undergone selective unlearning for verbatim data (SUV) or something similar. As you understand "sometimes generates verbatim code" and "just regurgitates [non-trivial] portions its input" are different statements.

The possibility of SUV clearly shows that a model does more than "just regurgitating."

link