Hacker News new | ask | show | jobs
by nraynaud 4413 days ago
is there some kind of test to know if we are past the optimal number of dimensions? I guess overfitting could be detected by the ratio between volume and area of the classification boundary.
4 comments

Cross-validation (actually, this is mentioned toward the end of the article). Basically, fit the the classifier with a subset of the data and test the predictions on the remainder. Predictions for out-of-sample data will be poor if you have overfitting.

http://en.wikipedia.org/wiki/Cross-validation_%28statistics%...

yeah, this one is quite intuitive, but it reduces the training sample size.
Once you find the optimal parameters, you can then train the model again on the entire dataset.
There's some additional information you'd need to determine this. Suppose, for the sake of argument, I only have one feature, X. Pretend I extend it into two dimensions by simply replicating the feature. In two dimensions, (X,X) forms a perfect straight line, which should make it clear that using an additional dimension didn't gain you anything.

Testing whether or not to include additional terms requires an understanding of the distribution of the response, as well as the amount of collinearity with the features (how similar the features are). There are some ways to do this in statistics, but this is more of something they do in inference as opposed to prediction.

Heuristically, the most common way is just to look at the cross-validated classification error and compare it with and without a feature (or set of features) in question. Asking about the distribution of the cross-validated classification rate is an interesting statistical question, though!

Cross-validation, separate training and test data sets, or AIC/BIC (AIC is more forgiving than BIC) if you can get a reasonable estimate of your "degrees of freedom". (For many models, however, d.f. is either not defined or intractable. For bagged or boosted trees, for example, you need CV or a test set.)

If you're data rich, you tend not to use CV but to have two or three sets. The reason 3 is better is because you ideally have (a) a training set for building models with known, fixed "hyperparameters" (e.g. regularization coefficients, tree sizes, neural net topologies), (b) a validation set for evaluating models with varying hyperparameters, in order to optimally select them, and (c) a test set on which you can evaluate the model for accuracy after your hyperparameters are chosen from b. Cross-validation is typically what you need to do when you have a small number of observations (say, 1000).

What do you do for a living?
You could make a plot like Figure 1. Look for the turning point (do some calculus if you can, i.e. d(perf)/d(dim) = 0).
Derivatives require continuity. It would be sufficient to simply look at which number of dimensions gave you the best cross-validated classification rate.
Alas, it rarely looks clean like that. Actually never, in my experience, for real problems.