| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nraynaud 4461 days ago
	is there some kind of test to know if we are past the optimal number of dimensions? I guess overfitting could be detected by the ratio between volume and area of the classification boundary.

4 comments

Bill_Dimm 4460 days ago

Cross-validation (actually, this is mentioned toward the end of the article). Basically, fit the the classifier with a subset of the data and test the predictions on the remainder. Predictions for out-of-sample data will be poor if you have overfitting.

http://en.wikipedia.org/wiki/Cross-validation_%28statistics%...

link

nraynaud 4460 days ago

yeah, this one is quite intuitive, but it reduces the training sample size.

link

Houshalter 4460 days ago

Once you find the optimal parameters, you can then train the model again on the entire dataset.

link

christopheraden 4460 days ago

There's some additional information you'd need to determine this. Suppose, for the sake of argument, I only have one feature, X. Pretend I extend it into two dimensions by simply replicating the feature. In two dimensions, (X,X) forms a perfect straight line, which should make it clear that using an additional dimension didn't gain you anything.

Testing whether or not to include additional terms requires an understanding of the distribution of the response, as well as the amount of collinearity with the features (how similar the features are). There are some ways to do this in statistics, but this is more of something they do in inference as opposed to prediction.

Heuristically, the most common way is just to look at the cross-validated classification error and compare it with and without a feature (or set of features) in question. Asking about the distribution of the cross-validated classification rate is an interesting statistical question, though!

link

michaelochurch 4460 days ago

Cross-validation, separate training and test data sets, or AIC/BIC (AIC is more forgiving than BIC) if you can get a reasonable estimate of your "degrees of freedom". (For many models, however, d.f. is either not defined or intractable. For bagged or boosted trees, for example, you need CV or a test set.)

If you're data rich, you tend not to use CV but to have two or three sets. The reason 3 is better is because you ideally have (a) a training set for building models with known, fixed "hyperparameters" (e.g. regularization coefficients, tree sizes, neural net topologies), (b) a validation set for evaluating models with varying hyperparameters, in order to optimally select them, and (c) a test set on which you can evaluate the model for accuracy after your hyperparameters are chosen from b. Cross-validation is typically what you need to do when you have a small number of observations (say, 1000).

link

hessenwolf 4460 days ago

What do you do for a living?

link

RK 4460 days ago

You could make a plot like Figure 1. Look for the turning point (do some calculus if you can, i.e. d(perf)/d(dim) = 0).

link

christopheraden 4460 days ago

Derivatives require continuity. It would be sufficient to simply look at which number of dimensions gave you the best cross-validated classification rate.

link

mturmon 4460 days ago

Alas, it rarely looks clean like that. Actually never, in my experience, for real problems.

link