Hacker News new | ask | show | jobs
by TehCorwiz 9 days ago
Because AI companies basically took everything we ever wrote, drew, recorded, posted, or thought and turned it into a product with the power to lie, propagandize, and manipulate the public with zero oversight. Walmart is a parasite using welfare to subsidize their operations but they didn't tell a judge that they were immune to copyright because they stole just so much damn information.
3 comments

Yes but all the AI companies took all the public data, so when you pay for an AI model you are paying for the marginal service of building a model off that data, not for the data itself. What we should do is ensure that the data is available to more people to train AI models... but sadly this doesn't seem to be happening. Instead AI companies that were first-movers got to train off public data, and as the companies and businesses that own this data get wise they're going to start charging people to train off the data. This will make it much more difficult for anyone to train a model in the future as it will become expensive, and the companies that did happen to already train off public data will get a bit of incumbent's advantage.
I don't really buy this argument. When you buy a physical product, you are paying the entire product lifecycle, not just the marginal aspect of retail distribution. This is the same thing. The marginal inference has to come FROM somewhere. It doesn't just appear out of nowhere.
This is the same argument people were making about how stealing music was the same as stealing a physical product.
AI companies took public, private, and copyrighted data. Your position is that because these big companies stole so much we should let them get away with it by devaluing it further so everyone can ignore intellectual property law.
They already "stole" it. They aren't giving it back and they've established their valuations based off of that. If they start paying now, it's simply going to be impossible for any more upstarts to do this or even release open-weight models because everyone with data will become rent seekers. Imagine if they started off with rent-seekers, we'd simply not have the benefit of these models at all at this point.
> when you pay for an AI model you are paying for the marginal service of building a model off that data, not for the data itself

Well no, you're also paying them for having done the work to "acquire" that data. That acquisition arguably amounts to the greatest theft in history.

If I understand the correctly the initial models were done by scraping the internet off public data. They now probably pay for access, especially to companies that hold the data. Even in the latter case, the content creators probably don't see anything because they signed away their rights using whatever free service they uploaded their work / comment to. In hacker news' case I'm sure some bot is scraping my prose right now and training something with it, which I'm totally fine with because the act of trying to rent seek my 0.00001 cent of value in this post is not worth the detriment to AI advancement.
So if they only use public domain data, and data from the Chinese and Europeans, do you still feel entitled to their valuations?

Because I hate to break it to you, they could have zero drop in quality by just not incorporating US data...

If we're talking about copyright, why are we somehow entitled to profits derived from stealing Taylor Swift's IP? Why do you get a cut of AI derivatives and not get half of her wealth, too, directly?

Microsoft has trained models entirely on synthetic and public data with SotA results.

> Because I hate to break it to you, they could have zero drop in quality by just not incorporating US data...

This is so false and unsupportable it's comical. The same goes the other way, if you claim they would use no value by only incorporating American data.

“Took?” AI companies aren’t removing the information from the public domain. What happened to “information wants to be free?”
I interpreted it to mean people feel as though they didn’t consent to having their information trained on, because for many folks, they published articles, open source projects, etc. assuming that they were only helping other people. It’s quite a shock to see megacorps use such data to create machines which threaten the livelihoods of the original authors themselves.

Also, much of the data used to train LLMs are not strictly public domain. For example, copyrighted books and source code with attribution-requiring licenses feature heavily in many corpuses. There are still pending lawsuits against the labs here, yet they continue to push forward. It’s no surprise that there is popular demand for redistribution.

> What happened to “information wants to be free?”

It was 1 part of an observation of opposed forces. “On the one hand information wants to be expensive, because it's so valuable. The right information in the right place just changes your life. On the other hand, information wants to be free, because the cost of getting it out is getting lower and lower all the time. So you have these two fighting against each other.”[1]

Some people removed the context and used it to say most information should be available to all. LLMs are information.

You thought this question proved what?

[1] https://sb.longnow.org/SB_homepage/Info_free_story.html

Individuals have faced federal charges and served prison time for reselling copyrighted content. I don't see the same happening to AI execs.
Are you proposing the Aaron Schwartz treatment by the government for Zuckerberg, Altman, and Amodei?
Yes. Took. As in: without permission. Didn't ask before hand, didn't provide a way to opt-out (although that would also be problematic), didn't ask for volunteers. Took.
The word you're looking for is "copied".

Don't fall for the great lie of intellectual "property".

If I can go to jail over it then they should too. Let's not judge them by some imaginary ideal world while judging individuals by the present crushing reality.
But you can not. At least not in US. Basically, you can not go to jail for it almost anywhere.