| > Sure it is. It just requires what every other copyright'd work needs: permission and stipulations from the copyright holder. Most other scenarios don't use millions/billions of works - that's the part which puts viability in question. > these are large scale businesses. I'd like training models to also remain accessible to open-source developers, academic researchers, and smaller businesses. Large-scale pretraining is common even for models that are not cutting-edge LLMs. > The only solace I take is that these conglomerates are paying a lot to take down the rules they made 30 years ago when they weren't the ones profiting from stealing As far as I'm aware, most of the lobbying in favor of stricter copyright has been done by Disney, Universal, Time Warner, RIAA, etc. Not to say that tech companies have a consistent moral stance beyond whatever's currently in their financial self-interest, but I think that self-interest has put them in a position of supporting fair use and copyright safe harbors, opposing link tax, etc. more often than the the other way around - with cases like Authors Guild v. Google being a significant win for fair use. |
Yes, they do. We have acquisitions in the billions these days and exclusivity deals in the hundreds of millions. Let's not pretend these companies can't do this through normal channels. They just wanna steal because they think they can get away from it.
>I'd like training models to also remain accessible to open-source developers, academic researchers, and smaller businesses.
Same. But such models still need to be ethically sourced. Maybe there's not enough royalty free content to compete with OpenAI, but it's pretty clear from Deepseek that you don't need 82 TB of data to be effective. If we need that much data, there are clearly optimizations to be made.
>I think that self-interest has put them in a position of supporting fair use and copyright safe harbors,
Yet they will sue anytime their data is scraped or otherwise not making the money. Maybe they didn't put trillions into lobbying like others, but they definitely have their fair share od using copyright. Microsoft won a lawsuit against web scraping via LinkedIn less than a year before OpenAI fell into legal troubles over scraping the entire internet.