| More data is better. You can reduce it via PCA one of the many techniques in multivariate statistic. You can do anova to select your predictors. In general you can use a subset of it using the tools that statistic have provided. Complaining about messy data... welcome to the real world. As for complaining about non-reproducible models , choose a reproducible ones. I've only done mostly statistical models and forest base algorithms and they're all reproducible. All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data? > The results were consistent with asymptotic theory (central limit theorem) that predicts that more data has diminishing returns CLT talks about sampling from the population infinitely. It doesn't say anything about diminishing returns. I don't get how you go from sampling infinitely to diminishing returns. |
The solution is to direct research effort towards learning algorithms that generalise well from few examples.
Don't expect the industry to lead this effort, though. The industry sees the reliance on large datasets as something to be exploited for a competitive advantage.
>> You can reduce it via PCA one of the many techniques in multivariate statistic.
PCA is a dimensionality reduction technique. It reduces the number of featuers required to learn. It doesn't do anything about the number of examples that are needed to guarantee good performance. The article is addressing the need for more examples, not more features.