|
|
|
|
|
by rfoo
1338 days ago
|
|
> Wikipedia demonstrated that crowdsourced data can be pretty competitive. The problem is DL is really sensitive to dirty data, disproportionately so. At $DAYJOB once we cleaned the dataset, removed a few mislabeled identity/face pairs (very few, about 1 in 1e4) and the metrics goes up a lot. |
|
In fact, generally DL is quite tolerant to label noise, especially using modern training methods such as SSL pretraining.
https://arxiv.org/pdf/1705.10694.pdf https://proceedings.neurips.cc/paper/2018/file/a19744e268754... https://proceedings.mlr.press/v97/hendrycks19a.html