|
|
|
|
|
by munksbeer
7 hours ago
|
|
When discussing this, may I ask (I know you are probably bored of the actual arguments), what does "trained models on data that wasn't theirs" actually mean in practice? Again, I know these arguments have been done to death, but every human who reads source code that wasn't written by them, or views art that wasn't created by them, and practices against this art, is training their brain on data "that wasn't theirs". They are frequently making a living doing so. Is this distinction the scale, or is there actually a different more strict definition that we should be using as a common language to talk about this? As in, I should not even be reading certain source code if it is not licensed appropriate, or I will be in breach because I'm training myself illegally? And the same question for art, etc? |
|
It hinges somewhat on the concept of how much you believe things are being learned and how much is just pattern matching and borrowing a solution from memory. Certainly in the early days of Copilot it was possible to get it to output chunks of open source code near verbatim.
I think, generally, people are probably closer to believing that there is some kind of reasoning being carried out by these models than in those early days but it would also be easy to strip all of the immediately identifiable comments etc from the training materials to make it harder to detect.