Hacker News new | ask | show | jobs
by mathgenius 3665 days ago
It's just so ridiculously easy to overfit these models, and so so many ways to shoot yourself in the foot as a result.

For example, "I split the data set into 5 random segments and then trained a model on 4 of the 5 segments and then tested it on 5th." Such data is serially correlated (it's not good old iid) so already it looks like you have poisoned the test set with information from the training set.

The hard part is not "feature engineering" or "ensemble methods", the hard part is controlling the entropy that you feed these things because they are voracious monsters and will absolutely eat all of it.

4 comments

> Such data is serially correlated (it's not good old iid) so already it looks like you have poisoned the test set with information from the training set.

Kind of. If it was that simple making money off of an autoregressive model would be trivial -> everyone would do it -> serial correlation would disappear.

I agree with your observation that figuring out what to feed the beast is one of the bigger challenges though. Case and point: train a mean reversion model on the last seven years of S&P data to buy dips and train a momentum model to buy higher highs. That equity curve would look very encouraging. Do it on a fifteen year basis, and not so much. Now the question becomes: how long of a lookback do you use when training your models? Chopping up data at random will mux out useful correlations. Subsetting into periods leads to poorly generalized models. Not fun.

It's so easy to do machine learning and think you're a genius when you are in fact overfitting. It's almost like casino gambling. You tweak some hyperparameter, pull the slot machine lever, and wham, your model says you should be rich real soon...
This is one of the better responses. Issues that arise: low # of data points at macro timescale, time series data (and local correlation between individual data points) making it hard to extract training/testing sets, and the overarching structural shifts in the market over time that invalidate older data (depending on context).
Doesn't that cut both ways though? If there are serial correlations in the data then modeling and accounting for the variance explained by those correlations should help with future predictions, no?