|
|
|
|
|
by danger
5076 days ago
|
|
Another way to say what you're proposing is to use a linear predictor, and to train it by globally optimizing 0-1 loss (just the number of mistakes that you make) on the full set of data that you have. Even ignoring computational issues (this can be shown to be NP-hard), you seem to be making several mistaken assumptions. I'd really recommend reading some basic stuff on generalization, but a couple of the mistaken assumptions are as follows: 1. That the particular input features that you've chosen are somehow the only possible choice. But who's to say that you shouldn't add new features which are the square of each original feature? Or maybe some cross product terms, like the product of the ith feature times the jth feature. Or maybe some good features to add would be the distance from each point you've seen so far. Etc. Continuing down this path, you basically get to the question discussed in the OP about choosing a kernel for SVMs. This is just one example where hyperparameters come into play, and you need some method for choosing them. 2. That a linear predictor is impervious to overfitting. Consider the extreme case (which comes up often) where you have millions or billions of features and far fewer examples (e.g., if features are n-gram occurrences in text, or gene expression data). Then it's likely that there are many settings of weights that fit the data perfectly, but there's no way to tell if you're just picking up on statistical noise, or if you've learned something that will make good predictions on new data that you encounter. In both theory and practice, you need some form of regularization, and along with this comes more hyperparameters, which need to be chosen. Finally, by your reasoning, it seems like you would always choose a 1-nearest neighbors classifier [1] (because it will always end up with 0 error under the setting your propose). But there's no reason why this is in general a good idea. [1] http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm |
|