| >Most other scenarios don't use millions/billions of works - that's the part which puts viability in question. Yes, they do. We have acquisitions in the billions these days and exclusivity deals in the hundreds of millions. Let's not pretend these companies can't do this through normal channels. They just wanna steal because they think they can get away from it. >I'd like training models to also remain accessible to open-source developers, academic researchers, and smaller businesses. Same. But such models still need to be ethically sourced. Maybe there's not enough royalty free content to compete with OpenAI, but it's pretty clear from Deepseek that you don't need 82 TB of data to be effective. If we need that much data, there are clearly optimizations to be made. >I think that self-interest has put them in a position of supporting fair use and copyright safe harbors, Yet they will sue anytime their data is scraped or otherwise not making the money. Maybe they didn't put trillions into lobbying like others, but they definitely have their fair share od using copyright. Microsoft won a lawsuit against web scraping via LinkedIn less than a year before OpenAI fell into legal troubles over scraping the entire internet. |
To clarify: veggieroll said training models wouldn't be viable, you said it'd just require licensing like everyone else already manages, I said most other cases don't use millions/billions of works, you're saying that yes they do?
I feel like there must be a misunderstanding here, because that doesn't make much sense to me. Even for making a movie, which I think would be the most onerous of traditional cases, the number of works you'd license would likely be in the dozens (couple of pop songs, some stock images, etc.) - not billions.
> Let's not pretend these companies can't do this through normal channels
I'm not sure that there really has been a normal channel for licencing at the scale of "almost everything on the public Internet". A compulsory licensing scheme, like the US has for cover songs, could make it feasible to pay into a pot - but again I'd really hope for model training to remain accessible to smaller players opposed to just "meh, OpenAI has billions".
> but it's pretty clear from Deepseek that you don't need 82 TB of data to be effective.
As far as I'm aware, DeepSeek is not a low-data model. In fact, given China's more lax approach to copyright, I would not be surprised if the ability to freely pass around shadow libraries and large archives of torrented data without lawsuits was one of the factors contributing to their fast success relative to western counterparts.
> If we need that much data, there are clearly optimizations to be made.
I don't think this is necessarily a given - humans evolved on ~4 billion years worth of data, after all.
> Yet they will sue anytime their data is scraped or otherwise not making the money. Maybe they didn't put trillions into lobbying like others, but they definitely have their fair share od using copyright.
I believe lawsuits launched by or fuss kicked up by model developers will typically be on a contract basis (i.e "you agreed to our ToS then broke it") rather than a copyright basis. Again not to say these tech companies are acting in any way except their own self-interest, just that they've generally been more pro-fair-use than pro-strict-copyright on average to my knowledge.