| > I said most other cases don't use millions/billions of works, you're saying that yes they do? I assumed we were talking about logistics, not tech. I'm sure it will be technically possibly to use less training data overtime (Deepseek is more or less demonstrating that in real time. Maybe there's copyright data but I'd be surprised if it used anything close to 80 TB like competittorz). I know hindsight is 20/20, but I always felt the earlier approaches were absurdly brute forced. >I'm not sure that there really has been a normal channel for licencing at the scale of "almost everything on the public Internet" There isn't. So they'd need to do it the old fashioned way with agreements . Or make some incentive model that has media submit their works with that understanding of training. Or any number of marketing ideas. I don't exactly pity their herculean effort. Those same companies spend decades suing individuals for much pettier uses and building those precedent up (some covered under free use). >and large archives of torrented data without lawsuits was one of the factors contributing to their fast success relative to western counterparts. And now they're being slowed down. If not litigsted out of the market. Public trust in AI is falling. The lack of oversight into hallucinations may have even cost a few lives. Content creators now need to take extra precautions so they aren't stolen from because they don't even bother trying to respect robots.txt. Even a few posts here on HN note how the scraping is so rampant that it can spike their hosting costs on websites (so now we need more capthas. And I hate myself for uttering such a sentence). Was all that velocity worth it? Who benefitted from this outside of a few billipnaires? We can't even say we beat China on this. >I don't think this is necessarily a given - humans evolved on ~4 billion years worth of data, after all Humans inherit their data and slowly structure around that. Maybe if AI models collaborated together as humanity did, I would sympathize more with this argument. We both know it's instead a rat race and the goal isn't survival and passing on knowledge (and genes) to the next generation. AI can evolve organically but it instead devolved into a thieve's den. I take the approach more like Bell's Spacecraft paradox. If they started gaining data ethically, by the time they gather a decent chunk they probably would have already optimized a model that needs less data. It'd be slower but not actually much slower I'm the long run. But they aren't exactly trying to go for quality here. >I believe lawsuits launched by or fuss kicked up by model developers will typically be on a contract basis (i.e "you agreed to our ToS then broke it") rather than a copyright basis. I suppose we'll see. Too early to tell. This lawsuit will definitely be precedent in other ongoing cases, but others may shift to a copyright infringement case anyway. Unlike other llms there was some human tailoring going on here, so it's not fully comparable to something like the NYT case. |
Still uncertain what you mean - the logistics of creating something? Logistics as in transporting goods? Either way I think veggieroll's point on viability still stands.
> Deepseek is more or less demonstrating that in real time. Maybe there's copyright data but I'd be surprised if it used anything close to 80 TB like competittorz
* GPT-4 is reported to have been trained on 13 trillion tokens total - which is counting two passes over a dataset of 6 trillion tokens[0]
* DeepSeek-V3, the previous model that DeepSeek-R1 was fine-tuned from, is reported to have been pre-trained on a dataset of 14.8 trillion tokens[1]
Can't find any licensing deals DeepSeek have made, so vast majority of that will almost certainly be unlicensed data - possibly from CommonCrawl and shadow libraries.
[0]: https://patmcguinness.substack.com/p/gpt-4-details-revealed
[1]: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSee...
> > > Let's not pretend these companies can't do this through normal channels.
> > I'm not sure that there really has been a normal channel [...]
> There isn't.
Then, surely it's not just pretending?
A while back, as a side project, I'd had a go at making a tool to describe photos for visually impaired users. I contacted Getty to see if I could license images for model training, and was told directly that they don't license images for machine learning. Particuarly given that I'm not massive company, I just don't think there really are any viable paths at the moment except for using web-scraped datasets.
> So they'd need to do it the old fashioned way with agreements .
I'm sceptical of whether even the largest companies would be able to get sufficient data for pre-training models like LLMs from only explicit licensing agreements.
> I don't exactly pity their herculean effort. Those same companies spend decades suing individuals for much pettier uses and building those precedent up (some covered under free use).
I feel you're conflating two groups: model developers that have previously been (on average) supportive of fair-use, and media companies (such as the ones currently launching lawsuits against model training) that lobbied for stronger copyright law. Both are acting in self-interest, but I'd disagree with the idea that there was any significant switching of sides on the topic of copyright.
> Content creators now need to take extra precautions so they aren't stolen from because they don't even bother trying to respect robots.txt.
The major US players claim to respect robots.txt[2][3][4], as does CommonCrawl[5] which is what the smaller players are likely to use.
You can verify that CommonCrawl respects robots.txt by downloading it yourself and checking.
If OpenAI/etc. are lying, it should be possible for essentially anyone hosting a website to prove it by showing access from one of the IPs they use for scraping[6]. (I say IPs rather than useragent string because anyone can set their useragent string to anything they want, and it's common for malicious/poorly-behaved actors to pretend to be a browser or more common bot).
[2]: https://platform.openai.com/docs/bots
[3]: https://support.anthropic.com/en/articles/8896518-does-anthr...
[4]: https://blog.google/technology/ai/an-update-on-web-publisher...
[5]: https://commoncrawl.org/faq
[6]: https://openai.com/gptbot.json
> Was all that velocity worth it? Who benefitted from this outside of a few billipnaires? We can't even say we beat China on this.
There's been a large range of beneficial uses for machine learning: language translation, video transcription, material/product defect detection, weather forecasting/early warning systems, OCR, spam filtering, protein folding, tumor segmentation, drug discovery and interaction prediction, etc.
I think this mainly comes back to my point that large-scale pretraining is not just for LLM chatbots. If you want to see the full impact, you can't just have tunnel-vision on the most currently-hyped product of the largest companies.
> Humans inherit their data and slowly structure around that. Maybe if AI models collaborated together as humanity did, I would sympathize more with this argument.
Machine learning in general (not "OpenAI") is a fairly open and collaborative field. Source code for training/testing is commonly available to use and improve; papers documenting algorithms, benchmarks, and experiments are freely available; arXiv (Cornell University's open-access preprint repository) is the place for AI papers, opposed to paywalled journals; and it's very common to fine-tune someone's existing pretrained model to perform a new task (transfer learning) opposed to training from scratch.
I'd attribute a lot of the field's success to building off each others' work in this way. In other industries, new concepts like transformers or low-rank-adaptation might still be languishing under a patent instead of having been integrated and improved on by countless other groups.
> AI can evolve organically but it instead devolved into a thieve's den.
Unclear what you mean by organically - evolution still needs data.