|
|
|
|
|
by wskish
1111 days ago
|
|
Do they mention anywhere the definition of "low quality" data or the proportion of removed data that was low quality versus duplicate? They mention "When upsampled, we expect SlimPajama to perform equal to or better than RedPajama-1T when training at trillion token scale." But i guess "upsampling" in this case is just explicit duplication of the training data. So the only potential gains would be from the removal of the low quality data? |
|
> But i guess "upsampling" in this case is just explicit duplication of the training data.
Possibly, but duplication means weighing and that is important in unbalanced trainingsets and improves the results in practice.