|
|
|
|
|
by orting
4122 days ago
|
|
I think the points are good, but I am not very happy about this statement "When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low. But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline (pre-processing, modeling, optimization algorithm, evaluation, productionization) accordingly." If done correctly, then I agree. But we have to be carefull about overfitting when we try out several models or make an initial analysis to determine which model to use. In this sense, choosing a model is no different from fitting the parameters of the model. |
|
If you aren't doing this correctly, then you can't really interpret the performance of even a single model. Seen people screw this up in so many ways - my favorite recent one that was quite high on HN was someone using the full dataset for variable selection, before doing a training-testing split afterwards.