I don't really know what you mean by evaluation. But you need to be able to (faithfully) generate all the positions your system would take through time, and also to generate all the returns you would have made through time.
Aside from pure P&L, you should be looking at how much risk your system is taking, and under what conditions it's doing badly. All backtests are overfit: their use is mostly in identifying problems with your strategy, rather than predicting how much money you'll make.
One question you'd get asked if you were proposing this in a real trading environment is this: what is it about the QM emini contract that makes this work? Does it work for other energy contracts? For other commodities? For bonds, or equities? If not, why not?
Basically I have a dataset and I train my model with 70% and then evaluate its guesses against the remaining 30%. Hence a baseline is created and I can see if my model performs better.
It took some doing to get this model to perform well. I did this by adding features that help recognize patterns in the time series data.
The features I created are not specific to QM as they are technical (eg. numbers, not news), and time-series related. So the models should work with any historical dataset with the same fields.
I feel like you're talking past me a little. The first thing you need to do is generate all the positions your system would have taken over as many years as possible, and figure out at what times you make and lose money. Otherwise you don't have a backtest.
I apologize. I can do that. I'm going to generate that backtest you described.
Right now I have residual data from the AWS machine learning data that tells me weather there is any structure to the times it does guess wrong. And a value below baseline is a better than 50/50 guess according to what I have learned about how AWS does its ML. Knowing that I use this personally as a supporting indicator to my trade decisions. Since its so new and I really don't want people to think I'm scamming or something. I'm just releasing my results free for now, not trying to be a douche ;)
AWS defines the baseline as follows
Baseline RMSE
Amazon ML provides a baseline metric for regression models. It is the RMSE for a hypothetical regression model that would always predict the mean of the target as the answer. For example, if you were predicting the age of a house buyer and the mean age for the observations in your training data was 35, the baseline model would always predict the answer as 35. You would compare your ML model against this baseline to validate if your ML model is better than a ML model that predicts this constant answer.
Aside from pure P&L, you should be looking at how much risk your system is taking, and under what conditions it's doing badly. All backtests are overfit: their use is mostly in identifying problems with your strategy, rather than predicting how much money you'll make.
One question you'd get asked if you were proposing this in a real trading environment is this: what is it about the QM emini contract that makes this work? Does it work for other energy contracts? For other commodities? For bonds, or equities? If not, why not?