| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wskish 1111 days ago
	Do they mention anywhere the definition of "low quality" data or the proportion of removed data that was low quality versus duplicate? They mention "When upsampled, we expect SlimPajama to perform equal to or better than RedPajama-1T when training at trillion token scale." But i guess "upsampling" in this case is just explicit duplication of the training data. So the only potential gains would be from the removal of the low quality data?

1 comments

yokaze 1111 days ago

> After removing punctuation, space symbols, newlines and tabs, we filtered out documents with less than 200 characters. These documents typically contain only meta data and no useful information.

> But i guess "upsampling" in this case is just explicit duplication of the training data.

Possibly, but duplication means weighing and that is important in unbalanced trainingsets and improves the results in practice.

link