| > I assumed we were talking about logistics, not tech Still uncertain what you mean - the logistics of creating something? Logistics as in transporting goods? Either way I think veggieroll's point on viability still stands. > Deepseek is more or less demonstrating that in real time. Maybe there's copyright data but I'd be surprised if it used anything close to 80 TB like competittorz * GPT-4 is reported to have been trained on 13 trillion tokens total - which is counting two passes over a dataset of 6 trillion tokens[0] * DeepSeek-V3, the previous model that DeepSeek-R1 was fine-tuned from, is reported to have been pre-trained on a dataset of 14.8 trillion tokens[1] Can't find any licensing deals DeepSeek have made, so vast majority of that will almost certainly be unlicensed data - possibly from CommonCrawl and shadow libraries. [0]: https://patmcguinness.substack.com/p/gpt-4-details-revealed [1]: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSee... > > > Let's not pretend these companies can't do this through normal channels. > > I'm not sure that there really has been a normal channel [...] > There isn't. Then, surely it's not just pretending? A while back, as a side project, I'd had a go at making a tool to describe photos for visually impaired users. I contacted Getty to see if I could license images for model training, and was told directly that they don't license images for machine learning. Particuarly given that I'm not massive company, I just don't think there really are any viable paths at the moment except for using web-scraped datasets. > So they'd need to do it the old fashioned way with agreements . I'm sceptical of whether even the largest companies would be able to get sufficient data for pre-training models like LLMs from only explicit licensing agreements. > I don't exactly pity their herculean effort. Those same companies spend decades suing individuals for much pettier uses and building those precedent up (some covered under free use). I feel you're conflating two groups: model developers that have previously been (on average) supportive of fair-use, and media companies (such as the ones currently launching lawsuits against model training) that lobbied for stronger copyright law. Both are acting in self-interest, but I'd disagree with the idea that there was any significant switching of sides on the topic of copyright. > Content creators now need to take extra precautions so they aren't stolen from because they don't even bother trying to respect robots.txt. The major US players claim to respect robots.txt[2][3][4], as does CommonCrawl[5] which is what the smaller players are likely to use. You can verify that CommonCrawl respects robots.txt by downloading it yourself and checking. If OpenAI/etc. are lying, it should be possible for essentially anyone hosting a website to prove it by showing access from one of the IPs they use for scraping[6]. (I say IPs rather than useragent string because anyone can set their useragent string to anything they want, and it's common for malicious/poorly-behaved actors to pretend to be a browser or more common bot). [2]: https://platform.openai.com/docs/bots [3]: https://support.anthropic.com/en/articles/8896518-does-anthr... [4]: https://blog.google/technology/ai/an-update-on-web-publisher... [5]: https://commoncrawl.org/faq [6]: https://openai.com/gptbot.json > Was all that velocity worth it? Who benefitted from this outside of a few billipnaires? We can't even say we beat China on this. There's been a large range of beneficial uses for machine learning: language translation, video transcription, material/product defect detection, weather forecasting/early warning systems, OCR, spam filtering, protein folding, tumor segmentation, drug discovery and interaction prediction, etc. I think this mainly comes back to my point that large-scale pretraining is not just for LLM chatbots. If you want to see the full impact, you can't just have tunnel-vision on the most currently-hyped product of the largest companies. > Humans inherit their data and slowly structure around that. Maybe if AI models collaborated together as humanity did, I would sympathize more with this argument. Machine learning in general (not "OpenAI") is a fairly open and collaborative field. Source code for training/testing is commonly available to use and improve; papers documenting algorithms, benchmarks, and experiments are freely available; arXiv (Cornell University's open-access preprint repository) is the place for AI papers, opposed to paywalled journals; and it's very common to fine-tune someone's existing pretrained model to perform a new task (transfer learning) opposed to training from scratch. I'd attribute a lot of the field's success to building off each others' work in this way. In other industries, new concepts like transformers or low-rank-adaptation might still be languishing under a patent instead of having been integrated and improved on by countless other groups. > AI can evolve organically but it instead devolved into a thieve's den. Unclear what you mean by organically - evolution still needs data. |