| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chpatrick 1214 days ago
	I'm also worried that courts will decide that model weights are copyrightable and the open source free-for-all will be over.

1 comments

nwienert 1214 days ago

If models can’t prove they are fully free of copyrighted data I don’t think they’ll have a leg to stand on there.

link

nl 1214 days ago

This is clearly not a given. Search engines are good decided case law in the opposite direction.

link

Nullabillity 1214 days ago

Search engines aren't a replacement for the original data, they're a way to direct traffic towards it.

link

nl 1214 days ago

The business model doesn't really change the IP considerations though.

(Additionally, newer LLMs like Perplexity.AI's correctly cite content sources, so that is even more similar to search engines)

link

nwienert 1214 days ago

These models will readily generate near identical outputs to copyrighted data, at length. This is not comparable to search.

link

nl 1214 days ago

I'd invite you to read "Foundation Models and Fair Use"[1] which is a paper written as a collaboration between Standford's law school and computer science department.

It talks at length about this specific problem and migration techniques for it:

Existing foundation models are trained on copyrighted material. Deploying these models can pose both legal and ethical risks when data creators fail to receive appropriate attribution or compensation. In the United States and several other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine. However, there is a caveat: If the model produces output that is similar to copyrighted data, particularly in scenarios that affect the market of that data, fair use may no longer apply to the output of the model. In this work, we emphasize that fair use is not guaranteed, and additional work may be necessary to keep model development and deployment squarely in the realm of fair use. First, we survey the potential risks of developing and deploying foundation models based on copyrighted content. We review relevant U.S. case law, drawing parallels to existing and potential applications for generating text, source code, and visual art. Experiments confirm that popular foundation models can generate content considerably similar to copyrighted material. Second, we discuss technical mitigations that can help foundation models stay in line with fair use. We argue that more research is needed to align mitigation strategies with the current state of the law.

[1] https://arxiv.org/abs/2303.15715

link

yuuuuyu 1214 days ago

If you are holding copyright to something, it will be on you to prove it's in there.

link

manojlds 1214 days ago

I am obviously clueless, but if it's a case, can't one demand what the training data is?

link

arthurcolle 1214 days ago

Based on what case law? In what jurisdiction?

link

nwienert 1214 days ago

It's very easy to show the models generating near-identical data to copyrighted data, which is enough to get courts to force them to allow discovery.

link

yuuuuyu 1213 days ago

This not have happened all over the place yet is evidence against this being as easy as you make it sound.

link