Hacker News new | ask | show | jobs
by rymate1234 1260 days ago
> It's been clearly displayed that these tools are emitting verbatim copies of existing code (and its comments) in their input.

Which makes sense when you consider that the sort of code that is getting reproduced verbatim is usually library functions which developers may copy and paste verbatim comments and all into their project, especially when you prompt the AI with the header of a function that has been copied and pasted often, so the weightings will in that instance be heavily skewed towards reproducing that function

1 comments

So that should make it easy to attribute, yes?
I think harder, as it is spammed around in all directions. It's easier to attribute a unique piece of code that appears in a single repo.

But boilerplate functions don't deserve copyright protection as they are not creative. Can I copyright print('hello world!') if I post it in my repo? Do I deserve a citation from now on?

For better or worse, AI is a combination of machine learning algorithms. And these algorithms are black boxes solely because we don't add observability to them - we aren't looking.

But there is a desire to understand why an AI provided the output it did (to increase trust in AI generated output), and so there's a lot of study and work going into adding that observability. Once that's in place, it becomes pretty straightforward to identify which inputs to a model provided what outputs.

I have never seen an ML researcher claim that understanding the effect of specific training inputs on outputs is straightforward given the size of these LLMs. Most view it as a very difficult if not impossible problem.
And yet it's a major part of the overall concept of being responsible with our use of AIs. Throwing our hands up in the air and prematurely declaring defeat is not an option long term.

It's a non-starter for no other reason than potential copyright infringement means the government becomes involved, and they will stomp on the AI mouse with the force of an elephant - the opinions of amateurs and the anti-copyright movement notwithstanding.

As such, AI Observability is a problem that's both under active research, and the basis for B2B companies.

https://censius.ai/wiki/ai-observability

https://towardsdatascience.com/what-is-ml-observability-29e8...

https://whylabs.ai/observability

https://arize.com/

Observability is great but it doesn’t give granular enough insights into what is actually happening.

Given a black box you can do two things: watch the black box for a while to see what it does, or take it apart to see how it works.

Observability is the former. Useful in many cases, just not here.

If you want to know what LLMs are actually doing, you’ll need the latter. Looking at weight activations for example, although with billions of parameters that’s infeasible.

Those companies are not solving the problem you are describing
Probably why, like the article says, they're planning to add that

> In an attempt to address the issues with open-source licensing, GitHub plans to introduce a new Copilot feature that will “provide a reference for suggestions that resemble public code on GitHub so that you can make a more informed decision about whether and how to use that code,” including “providing attribution where appropriate.” GitHub also has a configurable filter to block suggestions matching public code.