Hacker News new | ask | show | jobs
by dcwca 655 days ago
It is totally legal to train on this stuff, but illegal to reproduce copyrighted works. Interestingly, Google's business model could have been criticized the same way. They construct a big index of copyrighted works, reproduce them, and monetize it.
1 comments

They don't generate new content that convinces people it came from the sources they trained on.

The entire business model is "we trained on their stuff, pay us, not them." No way that's fair use.

I mean, if I go to the library and read books, and then get a job where I use that knowledge, the company pays me. Not the authors of the books I read.

So I don't see how their business model is any different from literally every person who learns things and then sells their ability to apply that knowledge.

The people that made the product possible get nothing, this is the difference. The library paid for a copy of the book, so did millions of others.

In the example you gave, it would be the equivalent to you getting a job, working hard to produce something, and get nothing in return.

What are you expecting the people who write the books to get?

Do you agree that if an author sold 43,958 copies, then it's fine for OpenAI to purchase one, so that the author sold 43,959? But also fine for OpenAI to ingest scanned used copies that are loaned to it? The same way it's fine for me to read a friend's book, or all of a friend's books, that they loan me, and the author doesn't get anything additional? The same way it's fine for me to go the library and the author doesn't get paid anything extra?

Or are you trying to invent some new principle where OpenAI has to pay some new ongoing fee? And if so, on what basis?

(And no, my example still stands entirely. It's from the perspective of somebody who learned from books, and they are getting paid, the same way people pay OpenAI to use ChatGPT. It's not from the perspective of authors, because again -- they make no additional money when somebody goes to the library to read their book that the library already purchased.)

It's not about what the "author should get for their book". It's the OpenAI benefits unfairly from using everyone's work to make nearly endless money and lobby for regulatory capture.

The author should get access to the model, the weights, it should all be open source because it partly contains their work. Just like how OpenAI could outright buy a copy of the authors work.

Basically, I think this is where knowledge and money are coming into an unresolveable conflict, who owns the ideas ? who owns information?

OpenAI seem to be trying to have a monopoly on information, and while they seem to be failing (thankfully), it's really where the issue lies for me.

Where are you getting this "nearly endless money" and "lobby for regulatory capture" and "monopoly on information"?

OpenAI competes with Google competes with a bunch of other companies, and surely this is only the beginning of a ton of competition as better and better models are developed. There's no "nearly endless money" when there's competition and GPU training costs a fortune.

The idea that all models should be open source to everyone or all content creators doesn't make any more sense than the idea that all the work I do should be open sourced to the authors of every book I've read, and every teacher I've ever had.

You ask two questions that have clear answers already:

> who owns the ideas?

Nobody. Legally speaking there's no such thing as ownership of ideas, except in the narrow case of patents (and if you consider trademarks to be ideas).

> who owns information?

You can copyright a particular, exact expression of information. The author of a book owns its text; the studio behind a movie owns the image in each frame.

But once you leave behind an exact expression of information, you're back in the realm of ideas, and there's no such thing as ownership of ideas. Which is why as long as ChatGPT and other models repeat ideas but not paragraphs of exact copyrighted wording, there's no legal issue. Because they're doing the same exact thing every human being does every day.

Writing a book about C isn't the same thing as "write me a mail server in C".

The right analogy here is you read their book about C and write another one in exactly (enough) style that your book can stand in for theirs, but you sell it for pennies.

The difference is: LLMs are not humans with human needs and human rights. Unless these for profit AI companies can ensure that they can fairly compensate the sources of their training data, they’re using IP they have no right to use in order to replace the work of living breathing humans who need income in order to live in houses and eat food. Why would you place the potential profits of the few (and the massive environmental impact of using LLMs) over the needs and rights of your neighbors and humans all around the world?