| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by HilbertSpace 5391 days ago

"training/testing datasets which absolutely critical to obtaining any reasonable model"

This is partly correct but, in general, too strong.

Am I commenting on the OP? Not really!

Why too strong? Because it assumes too little and sometimes more information is available and with the extra information a 'testing data set' may not be needed.

Why are 'testing data sets' important? If about all you have to go on is the 'historical data' and then are just searching for a 'model' based mostly just on what 'fits' the data, then, sure, a 'testing data set' will likely be just crucial. One way to get such a 'testing data set' is to partition the 'historical data' into two parts, use the first to 'fit' a model and the second to 'test' the fit. Of course, there are still risks: If fit 10,000 models, find 10 that fit well and test each of the 10 with the 'testing data set' and accept the model that fits the testing data the best, then still may have some problems from a 'generalized version of overfitting'! As I recall, there has been some mathematical statistics to address this issue.

Where can get by without a 'testing data set'? Broadly if know more than the meager assumptions common in 'machine learning' or 'curve fitting'.

What more can be known? In principle the variety is large.

Examples? Sure: Broadly just simple, old 'regression analysis', looked at as statistical estimation, makes a long list of quite detailed assumptions. E.g., we assume that there is a model the works and that we know in good detail the form of that model. We assume a lot about the 'historical data' we have, E.g., we assume 'homoscadasticity' and mean zero, independent and identically distributed (i.i.d.) Gaussian for the errors. We make some assumptions about dimensionality (e.g., to get around 'overfitting'). Then the usual derivations give minimum variance, unbiased estimates of the unknown parameters and more, all without any use of 'testing data'. "Look Ma, no testing data required!".

"Yes, son, but as your father kept telling you, a LOT of assumptions are required, and the assumptions are not all easy to verify. Or the regression derivations are a nice logical trip from island A to island B we would like to get to but we don't always know how to get to island A.".

Other examples? Sure: Calculate the trajectory of a space craft doing 'slingshots' in the inner solar system and then reaching, say, Saturn. We start with Newton's second law, his law of gravity, maybe a little about the solar wind, a lot of details about the orbits of the planets, and do some good numerical work with an initial value problem of an ordinary differential equation. We build a 'model' but don't really 'fit for parameters' or use 'historical data' and have no real use for 'testing data'. Why? Because we believe in Newton's laws and our numerical work. A 'model'? Yes. Fitting 'parameters'? No,

Can there be a connection between space craft trajectories and economic models? Sure: Bring more assumptions than just curve fitting. An example is to bring, essentially, accounting. So, then can get a Leontief input/output model. We bring basically just accounting data and not other historical data, do no real 'parameter' estimation, and use no 'testing' data. If the input data is noisy, then, sure, so will be the output and we might do some work with confidence intervals. Still we don't check with 'testing data'.

More examples? Sure: The broad field, with many techniques, of distribution-free statistical hypothesis testing is based on historical data and some assumptions and really needs no testing data. What is obtained is much like a 'model' where can plug in new data and get the intended results. The assumptions are typically that the data is i.i.d.

Net, a lot can be done beyond the common approach of machine learning curve fitting.

1 comments

dvse 5391 days ago

As usual, an excellent summary. Economic models based on low level data (essentially better "instrumentation", capturing bank transactions, some contracts, individual spending etc.) might be quite useful at least for short term prediction. Perhaps old ideas of "optimal control" can be to some extent realised.

link