Hacker News new | ask | show | jobs
by chrischen 17 days ago
Yes but all the AI companies took all the public data, so when you pay for an AI model you are paying for the marginal service of building a model off that data, not for the data itself. What we should do is ensure that the data is available to more people to train AI models... but sadly this doesn't seem to be happening. Instead AI companies that were first-movers got to train off public data, and as the companies and businesses that own this data get wise they're going to start charging people to train off the data. This will make it much more difficult for anyone to train a model in the future as it will become expensive, and the companies that did happen to already train off public data will get a bit of incumbent's advantage.
3 comments

I don't really buy this argument. When you buy a physical product, you are paying the entire product lifecycle, not just the marginal aspect of retail distribution. This is the same thing. The marginal inference has to come FROM somewhere. It doesn't just appear out of nowhere.
This is the same argument people were making about how stealing music was the same as stealing a physical product.
AI companies took public, private, and copyrighted data. Your position is that because these big companies stole so much we should let them get away with it by devaluing it further so everyone can ignore intellectual property law.
They already "stole" it. They aren't giving it back and they've established their valuations based off of that. If they start paying now, it's simply going to be impossible for any more upstarts to do this or even release open-weight models because everyone with data will become rent seekers. Imagine if they started off with rent-seekers, we'd simply not have the benefit of these models at all at this point.
> when you pay for an AI model you are paying for the marginal service of building a model off that data, not for the data itself

Well no, you're also paying them for having done the work to "acquire" that data. That acquisition arguably amounts to the greatest theft in history.

If I understand the correctly the initial models were done by scraping the internet off public data. They now probably pay for access, especially to companies that hold the data. Even in the latter case, the content creators probably don't see anything because they signed away their rights using whatever free service they uploaded their work / comment to. In hacker news' case I'm sure some bot is scraping my prose right now and training something with it, which I'm totally fine with because the act of trying to rent seek my 0.00001 cent of value in this post is not worth the detriment to AI advancement.