| > Yes, they do To clarify: veggieroll said training models wouldn't be viable, you said it'd just require licensing like everyone else already manages, I said most other cases don't use millions/billions of works, you're saying that yes they do? I feel like there must be a misunderstanding here, because that doesn't make much sense to me. Even for making a movie, which I think would be the most onerous of traditional cases, the number of works you'd license would likely be in the dozens (couple of pop songs, some stock images, etc.) - not billions. > Let's not pretend these companies can't do this through normal channels I'm not sure that there really has been a normal channel for licencing at the scale of "almost everything on the public Internet". A compulsory licensing scheme, like the US has for cover songs, could make it feasible to pay into a pot - but again I'd really hope for model training to remain accessible to smaller players opposed to just "meh, OpenAI has billions". > but it's pretty clear from Deepseek that you don't need 82 TB of data to be effective. As far as I'm aware, DeepSeek is not a low-data model. In fact, given China's more lax approach to copyright, I would not be surprised if the ability to freely pass around shadow libraries and large archives of torrented data without lawsuits was one of the factors contributing to their fast success relative to western counterparts. > If we need that much data, there are clearly optimizations to be made. I don't think this is necessarily a given - humans evolved on ~4 billion years worth of data, after all. > Yet they will sue anytime their data is scraped or otherwise not making the money. Maybe they didn't put trillions into lobbying like others, but they definitely have their fair share od using copyright. I believe lawsuits launched by or fuss kicked up by model developers will typically be on a contract basis (i.e "you agreed to our ToS then broke it") rather than a copyright basis. Again not to say these tech companies are acting in any way except their own self-interest, just that they've generally been more pro-fair-use than pro-strict-copyright on average to my knowledge. |
I assumed we were talking about logistics, not tech. I'm sure it will be technically possibly to use less training data overtime (Deepseek is more or less demonstrating that in real time. Maybe there's copyright data but I'd be surprised if it used anything close to 80 TB like competittorz).
I know hindsight is 20/20, but I always felt the earlier approaches were absurdly brute forced.
>I'm not sure that there really has been a normal channel for licencing at the scale of "almost everything on the public Internet"
There isn't. So they'd need to do it the old fashioned way with agreements . Or make some incentive model that has media submit their works with that understanding of training. Or any number of marketing ideas.
I don't exactly pity their herculean effort. Those same companies spend decades suing individuals for much pettier uses and building those precedent up (some covered under free use).
>and large archives of torrented data without lawsuits was one of the factors contributing to their fast success relative to western counterparts.
And now they're being slowed down. If not litigsted out of the market. Public trust in AI is falling. The lack of oversight into hallucinations may have even cost a few lives. Content creators now need to take extra precautions so they aren't stolen from because they don't even bother trying to respect robots.txt. Even a few posts here on HN note how the scraping is so rampant that it can spike their hosting costs on websites (so now we need more capthas. And I hate myself for uttering such a sentence).
Was all that velocity worth it? Who benefitted from this outside of a few billipnaires? We can't even say we beat China on this.
>I don't think this is necessarily a given - humans evolved on ~4 billion years worth of data, after all
Humans inherit their data and slowly structure around that. Maybe if AI models collaborated together as humanity did, I would sympathize more with this argument.
We both know it's instead a rat race and the goal isn't survival and passing on knowledge (and genes) to the next generation. AI can evolve organically but it instead devolved into a thieve's den.
I take the approach more like Bell's Spacecraft paradox. If they started gaining data ethically, by the time they gather a decent chunk they probably would have already optimized a model that needs less data. It'd be slower but not actually much slower I'm the long run. But they aren't exactly trying to go for quality here.
>I believe lawsuits launched by or fuss kicked up by model developers will typically be on a contract basis (i.e "you agreed to our ToS then broke it") rather than a copyright basis.
I suppose we'll see. Too early to tell. This lawsuit will definitely be precedent in other ongoing cases, but others may shift to a copyright infringement case anyway. Unlike other llms there was some human tailoring going on here, so it's not fully comparable to something like the NYT case.