| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by brianjking 1219 days ago
	This is great, but similar to GPT4All, it will likely be deemed unusable for any commercial or otherwise "legitimate" use cases since it's trained on OpenAI completions from sharegpt.com. https://github.com/nomic-ai/gpt4all

6 comments

qgin 1219 days ago

I don’t see how OpenAI is in any position to make a stand against scraping.

link

echelon 1219 days ago

And where did OpenAI get their data from?

I'm somewhat minded to hedge bets that all AI outputs will ruled be non-copyrightable.

I know of a research team training against my TTS service. They've got five robots ingesting data now.

It's all ouroboros.

link

sebzim4500 1219 days ago

It's made by few enough people that it is possible none of them have ever accepted the OpenAI terms of service, in which case there is no problem.

The bigger problem is they used the weights from Meta, which are possibly copyrightable in the US and likely copyrightable in the EU.

link

shp0ngle 1219 days ago

I don’t understand… GPT is itself trained on copyrighted text?

link

jerojero 1219 days ago

Yes it probably is.

However chatGPT's term of service are explicit about disallowing people to use code generation from the model to train other models. Now, how enforceable that might be... that probably remains a matter for the courts.

But doing what these people have done is breaking the terms of service of the application.

link

Anunayj 1218 days ago

OpenAI persuing that would likely end up setting precedent for it's own demise, I'd like to see them try.

link

visarga 1219 days ago

I honestly think it's the better way to deal with this problem - nothing the model generates should be copyrightable. You can use model outputs for anything unless the model replicates training data verbatim. This leaves a path open for AI skills to trickle down to open source models. It's a pity we can't copyright model outputs or the models themselves (also a result of a mechanistic process), but better in the long run.

We should not protect ideas from replication, only the expression should be copyrightable. Using data from another model extracts the ideas without the expression of the original training set, exactly following the idea/expression rule.

link

Zuiii 1219 days ago

Unless you sourced the training dataset and trained it on that model, the US copyright office disagrees: https://www.federalregister.gov/documents/2023/03/16/2023-05...

link