Hacker News new | ask | show | jobs
by dathinab 1812 days ago
> I cannot see why this would be a problem, or why an organic would become something superior that what they do is a creation and a ML mode is scraping your code. What is the difference here????

TL;DR: The AI doesn't know it can't just copy past (from perfect memory) and as such it learned to sometimes just copy past thinks.

The GPT model doesn't: "learn to understand the code and reproduce code based on that knowledge".

What it learns is a bit of understanding but more similar to recombining and tweaking verbatim text snipped it had seen before, without even understanding them or the concept of "not just copy/pasting code". (But while knowing which patterns "fit together").

This means that the model will "if it fits" potentially copy/past code "from memory" instead of writing new code which just happens to be the same/similar. It's like a person with perfect memory sometimes copy pasting code they had seen before pretending they wrote the code based on their "knowledge". Except worse, as it also will copy semantic irrelevant comments or sensitive information (if not pre filtered out before training).

I.e. there is a difference between "having a different kind of understanding" and "vastly missing understanding but compensating it by copying remembered code snippets from memory".

Theoretically it could be possible to create a GPT model which is forced to only understand programming (somewhat) but not memorize text snippets, but practically I think we are still far away from this, as it's really hard to say if a model did memorize copyright protected code.