|
|
|
|
|
by yokaze
1109 days ago
|
|
> After removing punctuation, space symbols, newlines and tabs, we filtered out documents with less than 200 characters. These documents typically contain only meta data and no useful information. > But i guess "upsampling" in this case is just explicit duplication of the training data. Possibly, but duplication means weighing and that is important in unbalanced trainingsets and improves the results in practice. |
|