|
|
|
|
|
by Kurtz79
3551 days ago
|
|
Isn't overfitting historical data a basic mistake in any machine learning exercise, regardless of the domain ? I thought the common practice was using part of the historical data for creating the model, and another sizable, non overlapping chunk to validate it. |
|
One problem is that too often, people break the data into a training set and a testing set. Then they train N algos on the training data, test them on the testing data, and then trade on the algo that tested best.
Once you use the testing set for more than one algo, it's really a meta-training set.
Really, you need a training set, a testing set, and a validation set. If you use the validation data set with more than one algo, it's no longer a validation set.
So, you train N algos, test N algos. Pick the best, and validate it. If validation fails, do you have enough discipline to wait for more data to come in and try again? Most people do not and will make hand-wavy arguments about why it's okay to re-shuffle the same data into 3 data sets and try again.