| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ipsum2 1813 days ago
	Yup, this isn't a theoretical concern, but a major practical one. GPT models are known for memorizing their training data: https://towardsdatascience.com/openai-gpt-leaking-your-data-... Edit: Github mentions the issue here: https://docs.github.com/en/github/copilot/research-recitatio... and here: https://copilot.github.com/#faq-does-github-copilot-recite-c... though they neatly ignore the issue of licensing :)

3 comments

bogwog 1813 days ago

That second link says the following:

> We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set

That's kind of a useless stat when you consider that the code it generates makes use of your existing variable/class/function names when adapting the code it finds.

I'm not a lawyer, but I'm pretty sure I can't just bypass GPL by renaming some variables.

link

kitsune_ 1813 days ago

It's not just about regurgitating training data during a beam search, it's also about being a derivative work, which it clearly is in my opinion.

link

visarga 1813 days ago

> GPT models are known for memorizing their training data

Hash each function, store the hashes as a blacklist. Then you can ask the model to regenerate the function until it is copyright safe.

link

ipsum2 1813 days ago

What if it copies only a few lines, but not an entire function? Or the function name is different, but the code inside is the same?

link

proteal 1813 days ago

If we could answer those questions definitively, we could also put lawyers out of a job. There’s always going to be a legal gray area around situations like this.

link

ipsum2 1813 days ago

Matching on the abstract syntax tree might be sufficient, but might be complex to implement.

link

vlasev 1813 days ago

You can probably tokenize the names so they become irrelevant. You can ignore non-functional whitespace, so that code C remains. Maybe one can hash all the training data D such that hash(C) is in hash(D). Some sort of Bloom filter...

link