| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by belter 854 days ago
	For all the Model Cards and License notices, I find it interesting there is not much information on the contents of the dataset used for training. Specifically, if it contains data subject to Copyright restrictions. Or did I miss that?

1 comments

Yeah, its an unspoken but rampant thing in the llm community. Basically no one respects licenses for training data.

I'd say the majority of instruct tunes, for instance, use OpenAI output (which is against their TOS).

But its all just research! So who cares! Or at least, that seems to be the mood.