|
|
|
|
|
by godelski
870 days ago
|
|
So I am a ML researcher. Note that part of my comment is specifying how difficult it actually is to ensure lack of spoilage. The second paper I link is actually a pretty good proof of this. Though I'll say that I wish they had been a bit more explicit about how a random pruning significantly improves results. Because that is quite the result in of itself, given that the datasets they look at are already filtered. Dedupe is fucking hard. So I'm not looking for a handwavy "trust me" I'm looking for the explicit vetting processes applied to these specific datasets. It's incredibly important to know the limits of your tools and that includes datasets (as well as metrics). |
|