| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Lockyy 1117 days ago
	It seems relatively straight forward (famous last words) to assess whether actual copyrighted text is embedded within the network. If you can prompt output that includes verbatim extracts when the copyright avoidance post-processing is disabled then you know that it has been consumed. Of course whether that was purposeful or inadvertently as a part of the larger training set would not be determined but you would know that the text is in there.

2 comments

Aerroon 1117 days ago

>If you can prompt output that includes verbatim extracts

If I create a program that picks random words from a dictionary and I end up with a seed that generates that text verbatim, then does that mean my program contains the copyrighted text?

You might be able to craft an intricate prompt that just happens to recreate that copyrighted text. Run it enough times until you get it verbatim and done.

link

constantcrying 1117 days ago

>If I create a program that picks random words from a dictionary

And LLMs do that, except prior to picking the word, they do complex statistics to figure out the probability distributions of those words.

Almost certainly some combination of input and RNG seed will produce any "small" combination of words.

link

constantcrying 1117 days ago

> you can prompt output that includes verbatim extracts when the copyright avoidance post-processing is disabled then you know that it has been consumed.

No, you know that likely that part was consumed. You would need to show that it will generate arbitrary passages from the text.

And LLMs are inherently random, so proof that this happens is very difficult to obtain and showing that it is actual output nearly impossible, especially if you just have API access and can't use the model directoy (e.g. fix the RNG seed).

If you have that you can debate if it is/isn't fair use.

link

Lockyy 1116 days ago

Arbitrary passages is what I meant by "verbatim extracts."

link