|
Okay, I started watching your video lectures at your 15 City Sense Data and through your application of maximum likelihood estimation. So, you drag out some common distributions, e.g., Poisson, negative binomial, beta, and look for the best "fit" for your data. Here you have little or no reason to believe that the data has any of those distributions. So, the work is all rules of thumb, intuitive heuristics, crude considerations, e.g., discrete or continuous and what the "support" is, etc. Then the quality is added on later -- see if the results are good on the test data. If not, then try again with more heuristics. If so, then use the "model" in practice. Cute. Apparently you are calling this two step technique, (i) use heuristics and then (ii) validate with the training data the influence of "machine learning". Okay. But there is a more traditional approach: When we use, e.g., the Poisson distribution, it is because we have good reason to believe that the data really is Poisson distributed. Maybe we just estimate the one parameter of the distribution and then continue on. It will be good also to check with some test data, but really, logically it is not necessary. Moreover, we may have some predictions for which we have no test data. An example of this traditional approach where we have no test data and, really, no "training" data in the sense you are assuming, is the work I outlined for predicting the lifetime of submarines in a special scenario of global nuclear war limited to sea. There we got a Poisson process from some work of Koopmans. But we had no data on sinking submarines in real wars, not for "training" or "testing". We can similar things for, say, arrivals at a Web site because from the renewal theorem we have good reason to believe that the arrivals, over, say, 10 minutes at a time, form a Poisson process. Then we could estimate the arrival rate and, if we wish, do an hypothesis test for anomalies where the null hypothesis was that the arrivals were as in our Poisson process, all without any heuristic fitting and detecting anomalies that were not in any data we had observed so far. So, generally we have two approaches: (1) Machine learning where we get test data, use heuristics to build a model, test the model on test data, and declare success when we pass the test. (2) Math assumptions where we have reason to know the distributions and form of a model, and use possibly much less data to determine the relatively few free parameters in the model and then just apply the model (testing also if we have some suitable test data -- in the submarine model, we did not). A problem for approach (1) machine learning is that we should be fairly sure that when we apply the model at point x, point x is distributed as in the training data. So, essentially we are just interpolating. E.g., if we have positive integers n and d, the set of real numbers R, training data pairs (y_i, x_i) for i = 1, 2, ..., n where y_i in R and x_i in R^d, fit a model that predicts each y_i from each_x_i, then to apply the model at x in R^d, we should know that x is distributed like the x_i. So, we are assuming that the x_i have a "distribution", that is, are independent, identically distributed (i.i.d.) "samples" from some distribution. And our model is a function f: R^d --> R. Then to apply our model a x in R^d we evaluate f(x). We hope that our f(x) will be a good approximation of the y in R that would correspond to our x in R^d. So, again, to do this, we should check if x in R^d is distributed like the x_i in R^d. I did see your work on anomaly detection, but it was much simpler and really not in the same context as an hypothesis test with null hypothesis that the x and x_i are i.i.d. in R^d. I can believe that at times approach (1) can work and be useful in practice, but (A) a lot of data is needed (e.g., can't apply this approach to the problem of the submarine lifetimes) and (B) are limited in the values of x that can be used in the model (can't do an hypothesis test to detect anomalous arrivals beyond any seen before at a Web site). From all I could see in your lectures, you neglected approach (2); readers should be informed that something like (2) is missing. |