| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Ukv 495 days ago

> I assumed we were talking about logistics, not tech

Still uncertain what you mean - the logistics of creating something? Logistics as in transporting goods? Either way I think veggieroll's point on viability still stands.

> Deepseek is more or less demonstrating that in real time. Maybe there's copyright data but I'd be surprised if it used anything close to 80 TB like competittorz

* GPT-4 is reported to have been trained on 13 trillion tokens total - which is counting two passes over a dataset of 6 trillion tokens[0]

* DeepSeek-V3, the previous model that DeepSeek-R1 was fine-tuned from, is reported to have been pre-trained on a dataset of 14.8 trillion tokens[1]

Can't find any licensing deals DeepSeek have made, so vast majority of that will almost certainly be unlicensed data - possibly from CommonCrawl and shadow libraries.

[0]: https://patmcguinness.substack.com/p/gpt-4-details-revealed

[1]: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSee...

> > > Let's not pretend these companies can't do this through normal channels.

> > I'm not sure that there really has been a normal channel [...]

> There isn't.

Then, surely it's not just pretending?

A while back, as a side project, I'd had a go at making a tool to describe photos for visually impaired users. I contacted Getty to see if I could license images for model training, and was told directly that they don't license images for machine learning. Particuarly given that I'm not massive company, I just don't think there really are any viable paths at the moment except for using web-scraped datasets.

> So they'd need to do it the old fashioned way with agreements .

I'm sceptical of whether even the largest companies would be able to get sufficient data for pre-training models like LLMs from only explicit licensing agreements.

> I don't exactly pity their herculean effort. Those same companies spend decades suing individuals for much pettier uses and building those precedent up (some covered under free use).

I feel you're conflating two groups: model developers that have previously been (on average) supportive of fair-use, and media companies (such as the ones currently launching lawsuits against model training) that lobbied for stronger copyright law. Both are acting in self-interest, but I'd disagree with the idea that there was any significant switching of sides on the topic of copyright.

> Content creators now need to take extra precautions so they aren't stolen from because they don't even bother trying to respect robots.txt.

The major US players claim to respect robots.txt[2][3][4], as does CommonCrawl[5] which is what the smaller players are likely to use.

You can verify that CommonCrawl respects robots.txt by downloading it yourself and checking.

If OpenAI/etc. are lying, it should be possible for essentially anyone hosting a website to prove it by showing access from one of the IPs they use for scraping[6]. (I say IPs rather than useragent string because anyone can set their useragent string to anything they want, and it's common for malicious/poorly-behaved actors to pretend to be a browser or more common bot).

[2]: https://platform.openai.com/docs/bots

[3]: https://support.anthropic.com/en/articles/8896518-does-anthr...

[4]: https://blog.google/technology/ai/an-update-on-web-publisher...

[5]: https://commoncrawl.org/faq

[6]: https://openai.com/gptbot.json

> Was all that velocity worth it? Who benefitted from this outside of a few billipnaires? We can't even say we beat China on this.

There's been a large range of beneficial uses for machine learning: language translation, video transcription, material/product defect detection, weather forecasting/early warning systems, OCR, spam filtering, protein folding, tumor segmentation, drug discovery and interaction prediction, etc.

I think this mainly comes back to my point that large-scale pretraining is not just for LLM chatbots. If you want to see the full impact, you can't just have tunnel-vision on the most currently-hyped product of the largest companies.

> Humans inherit their data and slowly structure around that. Maybe if AI models collaborated together as humanity did, I would sympathize more with this argument.

Machine learning in general (not "OpenAI") is a fairly open and collaborative field. Source code for training/testing is commonly available to use and improve; papers documenting algorithms, benchmarks, and experiments are freely available; arXiv (Cornell University's open-access preprint repository) is the place for AI papers, opposed to paywalled journals; and it's very common to fine-tune someone's existing pretrained model to perform a new task (transfer learning) opposed to training from scratch.

I'd attribute a lot of the field's success to building off each others' work in this way. In other industries, new concepts like transformers or low-rank-adaptation might still be languishing under a patent instead of having been integrated and improved on by countless other groups.

> AI can evolve organically but it instead devolved into a thieve's den.

Unclear what you mean by organically - evolution still needs data.