| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by johnnyanmac 496 days ago

>Most other scenarios don't use millions/billions of works - that's the part which puts viability in question.

Yes, they do. We have acquisitions in the billions these days and exclusivity deals in the hundreds of millions. Let's not pretend these companies can't do this through normal channels. They just wanna steal because they think they can get away from it.

>I'd like training models to also remain accessible to open-source developers, academic researchers, and smaller businesses.

Same. But such models still need to be ethically sourced. Maybe there's not enough royalty free content to compete with OpenAI, but it's pretty clear from Deepseek that you don't need 82 TB of data to be effective. If we need that much data, there are clearly optimizations to be made.

>I think that self-interest has put them in a position of supporting fair use and copyright safe harbors,

Yet they will sue anytime their data is scraped or otherwise not making the money. Maybe they didn't put trillions into lobbying like others, but they definitely have their fair share od using copyright. Microsoft won a lawsuit against web scraping via LinkedIn less than a year before OpenAI fell into legal troubles over scraping the entire internet.

1 comments

Ukv 496 days ago

> Yes, they do

To clarify: veggieroll said training models wouldn't be viable, you said it'd just require licensing like everyone else already manages, I said most other cases don't use millions/billions of works, you're saying that yes they do?

I feel like there must be a misunderstanding here, because that doesn't make much sense to me. Even for making a movie, which I think would be the most onerous of traditional cases, the number of works you'd license would likely be in the dozens (couple of pop songs, some stock images, etc.) - not billions.

> Let's not pretend these companies can't do this through normal channels

I'm not sure that there really has been a normal channel for licencing at the scale of "almost everything on the public Internet". A compulsory licensing scheme, like the US has for cover songs, could make it feasible to pay into a pot - but again I'd really hope for model training to remain accessible to smaller players opposed to just "meh, OpenAI has billions".

> but it's pretty clear from Deepseek that you don't need 82 TB of data to be effective.

As far as I'm aware, DeepSeek is not a low-data model. In fact, given China's more lax approach to copyright, I would not be surprised if the ability to freely pass around shadow libraries and large archives of torrented data without lawsuits was one of the factors contributing to their fast success relative to western counterparts.

> If we need that much data, there are clearly optimizations to be made.

I don't think this is necessarily a given - humans evolved on ~4 billion years worth of data, after all.

> Yet they will sue anytime their data is scraped or otherwise not making the money. Maybe they didn't put trillions into lobbying like others, but they definitely have their fair share od using copyright.

I believe lawsuits launched by or fuss kicked up by model developers will typically be on a contract basis (i.e "you agreed to our ToS then broke it") rather than a copyright basis. Again not to say these tech companies are acting in any way except their own self-interest, just that they've generally been more pro-fair-use than pro-strict-copyright on average to my knowledge.

link

johnnyanmac 496 days ago

> I said most other cases don't use millions/billions of works, you're saying that yes they do?

I assumed we were talking about logistics, not tech. I'm sure it will be technically possibly to use less training data overtime (Deepseek is more or less demonstrating that in real time. Maybe there's copyright data but I'd be surprised if it used anything close to 80 TB like competittorz).

I know hindsight is 20/20, but I always felt the earlier approaches were absurdly brute forced.

>I'm not sure that there really has been a normal channel for licencing at the scale of "almost everything on the public Internet"

There isn't. So they'd need to do it the old fashioned way with agreements . Or make some incentive model that has media submit their works with that understanding of training. Or any number of marketing ideas.

I don't exactly pity their herculean effort. Those same companies spend decades suing individuals for much pettier uses and building those precedent up (some covered under free use).

>and large archives of torrented data without lawsuits was one of the factors contributing to their fast success relative to western counterparts.

And now they're being slowed down. If not litigsted out of the market. Public trust in AI is falling. The lack of oversight into hallucinations may have even cost a few lives. Content creators now need to take extra precautions so they aren't stolen from because they don't even bother trying to respect robots.txt. Even a few posts here on HN note how the scraping is so rampant that it can spike their hosting costs on websites (so now we need more capthas. And I hate myself for uttering such a sentence).

Was all that velocity worth it? Who benefitted from this outside of a few billipnaires? We can't even say we beat China on this.

>I don't think this is necessarily a given - humans evolved on ~4 billion years worth of data, after all

Humans inherit their data and slowly structure around that. Maybe if AI models collaborated together as humanity did, I would sympathize more with this argument.

We both know it's instead a rat race and the goal isn't survival and passing on knowledge (and genes) to the next generation. AI can evolve organically but it instead devolved into a thieve's den.

I take the approach more like Bell's Spacecraft paradox. If they started gaining data ethically, by the time they gather a decent chunk they probably would have already optimized a model that needs less data. It'd be slower but not actually much slower I'm the long run. But they aren't exactly trying to go for quality here.

>I believe lawsuits launched by or fuss kicked up by model developers will typically be on a contract basis (i.e "you agreed to our ToS then broke it") rather than a copyright basis.

I suppose we'll see. Too early to tell. This lawsuit will definitely be precedent in other ongoing cases, but others may shift to a copyright infringement case anyway. Unlike other llms there was some human tailoring going on here, so it's not fully comparable to something like the NYT case.

link

Ukv 495 days ago

> I assumed we were talking about logistics, not tech

Still uncertain what you mean - the logistics of creating something? Logistics as in transporting goods? Either way I think veggieroll's point on viability still stands.

> Deepseek is more or less demonstrating that in real time. Maybe there's copyright data but I'd be surprised if it used anything close to 80 TB like competittorz

* GPT-4 is reported to have been trained on 13 trillion tokens total - which is counting two passes over a dataset of 6 trillion tokens[0]

* DeepSeek-V3, the previous model that DeepSeek-R1 was fine-tuned from, is reported to have been pre-trained on a dataset of 14.8 trillion tokens[1]

Can't find any licensing deals DeepSeek have made, so vast majority of that will almost certainly be unlicensed data - possibly from CommonCrawl and shadow libraries.

[0]: https://patmcguinness.substack.com/p/gpt-4-details-revealed

[1]: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSee...

> > > Let's not pretend these companies can't do this through normal channels.

> > I'm not sure that there really has been a normal channel [...]

> There isn't.

Then, surely it's not just pretending?

A while back, as a side project, I'd had a go at making a tool to describe photos for visually impaired users. I contacted Getty to see if I could license images for model training, and was told directly that they don't license images for machine learning. Particuarly given that I'm not massive company, I just don't think there really are any viable paths at the moment except for using web-scraped datasets.

> So they'd need to do it the old fashioned way with agreements .

I'm sceptical of whether even the largest companies would be able to get sufficient data for pre-training models like LLMs from only explicit licensing agreements.

> I don't exactly pity their herculean effort. Those same companies spend decades suing individuals for much pettier uses and building those precedent up (some covered under free use).

I feel you're conflating two groups: model developers that have previously been (on average) supportive of fair-use, and media companies (such as the ones currently launching lawsuits against model training) that lobbied for stronger copyright law. Both are acting in self-interest, but I'd disagree with the idea that there was any significant switching of sides on the topic of copyright.

> Content creators now need to take extra precautions so they aren't stolen from because they don't even bother trying to respect robots.txt.

The major US players claim to respect robots.txt[2][3][4], as does CommonCrawl[5] which is what the smaller players are likely to use.

You can verify that CommonCrawl respects robots.txt by downloading it yourself and checking.

If OpenAI/etc. are lying, it should be possible for essentially anyone hosting a website to prove it by showing access from one of the IPs they use for scraping[6]. (I say IPs rather than useragent string because anyone can set their useragent string to anything they want, and it's common for malicious/poorly-behaved actors to pretend to be a browser or more common bot).

[2]: https://platform.openai.com/docs/bots

[3]: https://support.anthropic.com/en/articles/8896518-does-anthr...

[4]: https://blog.google/technology/ai/an-update-on-web-publisher...

[5]: https://commoncrawl.org/faq

[6]: https://openai.com/gptbot.json

> Was all that velocity worth it? Who benefitted from this outside of a few billipnaires? We can't even say we beat China on this.

There's been a large range of beneficial uses for machine learning: language translation, video transcription, material/product defect detection, weather forecasting/early warning systems, OCR, spam filtering, protein folding, tumor segmentation, drug discovery and interaction prediction, etc.

I think this mainly comes back to my point that large-scale pretraining is not just for LLM chatbots. If you want to see the full impact, you can't just have tunnel-vision on the most currently-hyped product of the largest companies.

> Humans inherit their data and slowly structure around that. Maybe if AI models collaborated together as humanity did, I would sympathize more with this argument.

Machine learning in general (not "OpenAI") is a fairly open and collaborative field. Source code for training/testing is commonly available to use and improve; papers documenting algorithms, benchmarks, and experiments are freely available; arXiv (Cornell University's open-access preprint repository) is the place for AI papers, opposed to paywalled journals; and it's very common to fine-tune someone's existing pretrained model to perform a new task (transfer learning) opposed to training from scratch.

I'd attribute a lot of the field's success to building off each others' work in this way. In other industries, new concepts like transformers or low-rank-adaptation might still be languishing under a patent instead of having been integrated and improved on by countless other groups.

> AI can evolve organically but it instead devolved into a thieve's den.

Unclear what you mean by organically - evolution still needs data.

link