Hacker News new | ask | show | jobs
by visarga 1821 days ago
> GPT models are known for memorizing their training data

Hash each function, store the hashes as a blacklist. Then you can ask the model to regenerate the function until it is copyright safe.

1 comments

What if it copies only a few lines, but not an entire function? Or the function name is different, but the code inside is the same?
If we could answer those questions definitively, we could also put lawyers out of a job. There’s always going to be a legal gray area around situations like this.
Matching on the abstract syntax tree might be sufficient, but might be complex to implement.
You can probably tokenize the names so they become irrelevant. You can ignore non-functional whitespace, so that code C remains. Maybe one can hash all the training data D such that hash(C) is in hash(D). Some sort of Bloom filter...