Hacker News new | ask | show | jobs
by fire 1216 days ago
I don't understand why they aren't tagging data with license information and allowing users to use models that don't include certain licenses - seems like it would be the middle ground given the stance they've taken; like, "we don't think it's a problem, but if this makes you feel better you can use these other models that specifically don't train on gpl code, or whatever"

I would prefer to see full license attributions included in generated responses, though. Something that then also wouldn't be that difficult to generate a licenses file from?

Amazon's CodeWhisperer has a "reference tracker" that tells you the license of training data code if the generated response is within some similarity threshold, but that's still not good enough imo.

2 comments

> I would prefer to see full license attributions included in generated responses, though. Something that then also wouldn't be that difficult to generate a licenses file from?

Exactly. By all means build tools like this, but build them to actually comply with Open Source licenses. Provide a list of the licenses you don't mind copying from, and get back attributions with your suggestions.

Suppose Copilot offered some pure-MIT licensed flavor.

Copilot could comply with MIT licenses by just outputting an MIT license with ALL the authors of code used in training.

That'd be a valid solution, if impractical. I doubt that people would be willing to copy hundreds of thousands of license notices into their project.
One perspective is that those authors actually contributed to the end result.

But sure, disk size could be a problem.

> One perspective is that those authors actually contributed to the end result.

They absolutely did, yes. The approach you're suggesting would work from a legal perspective, but the size might pose practical problems.

> Amazon's CodeWhisperer has a "reference tracker" that tells you the license of training data code if the generated response is within some similarity threshold, but that's still not good enough imo.

I don't think it's possible to do better than that with this technology.

like I probably don't understand this in the right way, but I could have sworn we had the ability to probe latent space on models like these and make mappings based on them? Or was that only for diffusers?
Sure, you could build an index of attribution <> latent space coords, but it would not be clear whether a generated document near several index entries would require compliance.

I guess this is where the threshold comes from. Choose a generous margin and over-attribute rather than under-attribute.