| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by orting 4169 days ago

I think the points are good, but I am not very happy about this statement

"When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low. But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline (pre-processing, modeling, optimization algorithm, evaluation, productionization) accordingly."

If done correctly, then I agree. But we have to be carefull about overfitting when we try out several models or make an initial analysis to determine which model to use. In this sense, choosing a model is no different from fitting the parameters of the model.

3 comments

idunning 4169 days ago

If you are disciplined, and separate data into training and testing sets, you can try as many models as you want without fear of overfitting. Indeed, optimizing over the parameters of a model on the training set is essential (pruning parameters in a tree, regularization weights, etc.) and can be thought of as training large number of models.

If you aren't doing this correctly, then you can't really interpret the performance of even a single model. Seen people screw this up in so many ways - my favorite recent one that was quite high on HN was someone using the full dataset for variable selection, before doing a training-testing split afterwards.

link

stiff 4169 days ago

If you use performance on the test set for model selection, this is not true. It follows from simple probabilistic reasoning, the more models you try the higher the chance one will score well on both the training set and the test set by "luck", and this is especially true with small datasets. In fact it is a best practice to use a separate validation set for model selection and use the test set only for final performance evaluation, see e.g. the answer to this question:

http://stats.stackexchange.com/questions/9357/why-only-three...

link

chengtao 4169 days ago

I personally love the topic of bayesian optimization over all the possible parameters including model choice. My point was more about given the resource is always constrained, it typically pays off long term for practitioners to analyze the data, understand the underlying mechanics before jumping into modeling.

link

texthompson 4169 days ago

I thought exactly the same thing. Statistics is about uncertainty, and it's very easy to be misled when you don't correct for trying lots of hypotheses.

link