| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tpdly 52 days ago
	I think you undervalue the contribution of internet-scale data to foundation modeling, and because LLMs can obsolete the content they required, I think its fair to characterize it as theft. Obviously RL contributes a lot to capabilities, but the judgement that an LLM uses to 'synthesize information' is born from the training data. The scale of the data really is beyond intuition. books3, for example, would 230 yrs of continuous reading I actually think the "proprietary non-determenistic database of the free internet" does a lot to characterize the capabilities and effects to a lot of people. Obviously coders are more in tune with how well agents can work, but that's also due more to the RL breakthroughs than foundation modeling.

2 comments

d0mine 52 days ago

As I understand RL makes foundation models stupider (less capable, not more) but better at following instructions.

link

adamtaylor_13 52 days ago

Can you steal something that is free and openly available?

I just don't understand this argument. "Theft" feels like a nice, heavy, moral accusation to toss at those you're debating with, but the actual prerequisites for theft don't even exist in this situation.

link

mclightning 51 days ago

It is a lot more complicated than that. Your content is not simply used, copied, or even just simply distributed. The very terrain that you produce, distribute, represent your content has shifted due to the mechanics of it. Anything you produce is grabbed into AI summaries. They're grabbed into the training data. Humans produce free/open materials for many reasons. A lot of them don't have room to breathe and gain structure due to AI siphoning the entire atmosphere of web; eg communities

link

tpdly 51 days ago

I mean, not that I'm a huge fan of IP laws, but yes?

Like I said, if you provide an alternative to all these blogs and forums (because you trained on them or because you scrape them for RAG) then you are stealing their traffic. Search engines were/are already doing that, but the foundation training

link