Hacker News new | ask | show | jobs
by brianjking 1173 days ago
This is great, but similar to GPT4All, it will likely be deemed unusable for any commercial or otherwise "legitimate" use cases since it's trained on OpenAI completions from sharegpt.com.

https://github.com/nomic-ai/gpt4all

6 comments

I don’t see how OpenAI is in any position to make a stand against scraping.
And where did OpenAI get their data from?

I'm somewhat minded to hedge bets that all AI outputs will ruled be non-copyrightable.

I know of a research team training against my TTS service. They've got five robots ingesting data now.

It's all ouroboros.

It's made by few enough people that it is possible none of them have ever accepted the OpenAI terms of service, in which case there is no problem.

The bigger problem is they used the weights from Meta, which are possibly copyrightable in the US and likely copyrightable in the EU.

I don’t understand… GPT is itself trained on copyrighted text?
Yes it probably is.

However chatGPT's term of service are explicit about disallowing people to use code generation from the model to train other models. Now, how enforceable that might be... that probably remains a matter for the courts.

But doing what these people have done is breaking the terms of service of the application.

OpenAI persuing that would likely end up setting precedent for it's own demise, I'd like to see them try.
I honestly think it's the better way to deal with this problem - nothing the model generates should be copyrightable. You can use model outputs for anything unless the model replicates training data verbatim. This leaves a path open for AI skills to trickle down to open source models. It's a pity we can't copyright model outputs or the models themselves (also a result of a mechanistic process), but better in the long run.

We should not protect ideas from replication, only the expression should be copyrightable. Using data from another model extracts the ideas without the expression of the original training set, exactly following the idea/expression rule.

Unless you sourced the training dataset and trained it on that model, the US copyright office disagrees: https://www.federalregister.gov/documents/2023/03/16/2023-05...