Hacker News new | ask | show | jobs
by ipsum2 1813 days ago
Yup, this isn't a theoretical concern, but a major practical one. GPT models are known for memorizing their training data: https://towardsdatascience.com/openai-gpt-leaking-your-data-...

Edit: Github mentions the issue here: https://docs.github.com/en/github/copilot/research-recitatio... and here: https://copilot.github.com/#faq-does-github-copilot-recite-c... though they neatly ignore the issue of licensing :)

3 comments

That second link says the following:

> We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set

That's kind of a useless stat when you consider that the code it generates makes use of your existing variable/class/function names when adapting the code it finds.

I'm not a lawyer, but I'm pretty sure I can't just bypass GPL by renaming some variables.

It's not just about regurgitating training data during a beam search, it's also about being a derivative work, which it clearly is in my opinion.
> GPT models are known for memorizing their training data

Hash each function, store the hashes as a blacklist. Then you can ask the model to regenerate the function until it is copyright safe.

What if it copies only a few lines, but not an entire function? Or the function name is different, but the code inside is the same?
If we could answer those questions definitively, we could also put lawyers out of a job. There’s always going to be a legal gray area around situations like this.
Matching on the abstract syntax tree might be sufficient, but might be complex to implement.
You can probably tokenize the names so they become irrelevant. You can ignore non-functional whitespace, so that code C remains. Maybe one can hash all the training data D such that hash(C) is in hash(D). Some sort of Bloom filter...