Hacker News new | ask | show | jobs
by graycat 2904 days ago
I looked over the whole list of topics. They have a page of questions about prerequisites, and for all the questions I was able to answer that I was fully comfortable with the question. But the questions have to do with probability and optimization, and my applied math Ph.D. was in those fields with research in stochastic optimal control.

I can begin to see some of why Bloomberg is interested in the material: Maybe with all the economic and stock market data they collect, they can build some surprisingly useful predictive machine learning models. Okay. Maybe one good result will be that Michael Bloomberg will give some more money to Johns Hopkins!

But otherwise I was disappointed:

(1) Put simply and bluntly, it's all about just one now very old topic -- regression analysis. Well, when old regression analysis doesn't fit very well, then try logistic regression, ridge regression, regression trees, other forms of regression, other forms of curve fitting, e.g., neural networks.

Or, look, guys, it's all just empirical curve fitting. Ptolemy tried empirical curve fitting. He used his epicycles. They didn't fit very well. Then, later, from work of Kepler, a falling apple, etc., Newton guessed that (A) there was a law of motion due to force equals mass time acceleration and (B) there was a force directly proportional to the product of two masses and inversely proportional the the square of the distance between them. That fit the data great!

Lesson: Lots of things change continuously. Usually the changes are differentiable, that is, have a well defined tangent. Well, that tangent is linear. So at least locally, lots of things are linear. So, for lots of things, a promising first cut approach is to do a linear fit. For more, neural networks can approximate anything continuous, and lots of things are continuous. So, net, curve fitting has some promise and utility. But, still, curve fitting is just guessing without any real basis in science, e.g., couldn't find or replace what Newton did.

Really, the now classic texts in regression analysis start with something the machine learning curve fitting does not: The classic texts assume that there really is a linear equation, that we would have the equation exactly except for some errors in the data, and then do some nice applied math to show how to get the errors down and get a good approximation to the equation that has already been assumed to exist. The assumptions can vary, stronger or weaker, but in the theory there was little or no role for just empirical fitting. Well, machine learning is charging ahead without that assumption that a linear equation exists. And, wonder of wonders, apparently often now such an equation, even as a good approximation, doesn't exist. So, we are back to empirical curve fitting, struggling like Ptolemy. We already know: Some successes are possible, but like Ptolemy we face some severe limitations.

(2) Okay, when simple regression doesn't fit very well, we keep trying? Okay. Say, we try logistic regression, ridge regression, L1 or L2 regularization regression, regression trees, boosting, ..., neural networks, etc. Uh, which is better, L1 regularization or L2 regularization? Uh, this sounds like throwing stuff against the wall until something appears to stick. Sure, that can work at times, but are we really satisfied with that? Wouldn't we want some more solid reasons for the tool we pick? Lots of places elsewhere in applied math, applied probability, and applied statistics we do have solid reasons.

With so many efforts to patch up regression, maybe we might suspect that for a lot of problems regression is not the right tool?

(3) There is a lot more to applied statistics, applied probability, etc. and possibly of value for applications, maybe including Bloomberg's customers, than empirical curve fitting. Commonly this work has equal justification to be called machine learning because it also takes in data, estimates some parameters, builds a model, and gives results. Moreover, the better cases work with clear assumptions so that when the assumptions hold we know we have something solid, and some of the methods use meager assumptions that are relatively realistic in practice. The Bloomberg course is just empirical curve fitting, nearly all versions of regression analysis, and omits all the rest. On the applied math shelves of the research libraries, regression is only a tiny fraction of the whole. Where's the rest?

3 comments

Nice comment, echoes a lot of my feelings about ML. I have a question. You write

> (2) Okay, when simple regression doesn't fit very well, we keep trying? Okay. Say, we try logistic regression, ridge regression, L1 or L2 regularization regression, regression trees, boosting, ..., neural networks, etc. Uh, which is better, L1 regularization or L2 regularization? Uh, this sounds like throwing stuff against the wall until something appears to stick. Sure, that can work at times, but are we really satisfied with that? Wouldn't we want some more solid reasons for the tool we pick? Lots of places elsewhere in applied math, applied probability, and applied statistics we do have solid reasons.

I'm wondering what these "solid reasons" could be? Some sort of experience based on past data? An example would be helpful.

Let's see: Pick a system that we know to be linear. Get a lot of pairs of real inputs and outputs and find the coefficients of the linear function that relates the inputs to the outputs. Then given a new input, can say what the output would be.

Okay, the function, system between a violin on the stage at Carnegie Hall and a seat near the roof is linear. So, I'm typing quickly here, do regression with a recording at the stage and at the seat and look for the coefficients in the convolution. Then given an oboe, can say what it will sound like in the seat.

Hooks law with small deflections is linear. So, take independent variables the forces on a space frame and the dependent variables the deflections and estimate all the spring stiffness values.

Take 200 recipes for tomato sauce all from the same 10 ingredients, for each recipe measure the weight of each ingredient and the weight of the protein in the final sauce and estimate the protein in each of the 10 ingredients. Then for any new tomato sauce recipe, weigh the ingredients and get the protein in the final sauce.

> I'm wondering what these "solid reasons" could be?

For the classic assumptions for regression, we know that a linear function exists, our data has additive errors that have mean zero, constant variance, and are Gaussian.

Then we can get an F ratio test on the regression, t-tests on the coefficients, and confidence intervals on the predicted values. This is just the standard stuff in the classic derivations in several of the books I listed.

And with some weaker assumptions, there are still some such results we can get.

This is just for regression.

There's a lot more in applied statistics, that is, where we make some assumptions that seem accurate enough in practice, get some theorems, and benefit from the conclusions of the theorems.

For an example, also in this tread I wrote about estimating how long some submarines would last. The assumptions were explicit and, on a nice day, somewhat credible. If swallow the assumptions, then have to take the conclusions seriously.

I've done other problems in applied statistics -- have a problem, collect some data, see what assumptions can make, from the assumptions, have some theorems, from the theorems have some conclusions powerful for the problem. If swallow the assumptions, trust the software, ..., then have to take the conclusions seriously. So, get to check the software and argue about the assumptions.

For machine learning as in the Bloomberg course, have some training data and some test, validation data. Fit to the training data. Check with the test data. Assume that the real world situation doesn't change, apply the model, and smile on the way to the bank.

I guess, okay when it works. But:

(A) how many valuable real applications are there? I.e., regression has been around with solid software for decades, and I've yet to see the yachts of the regression practitioners -- maybe there are some now but what is new might mostly be hype. Or the problem was that the SAS founders were not very good at sales? Same for SPSS (now pushed by IBM), Mathlab, R, etc.?

(B) Wouldn't we also want confidence intervals on the predictions? Okay, maybe there are some resampling/bootstrap ways to get those, maybe.

(C) There's more to applied statistics than empirical curve fitting, and commonly there we can have "solid reasons". For more on applied statistics, what I've done is only a drop in the bucket, ocean, but there research libraries are awash in examples.

Graycat and I have a history of conversations on HN regarding the differences between statistics and machine learning approaches.

ML is a term that came from Russian probabilists and statisticians. ML is applied probability, optimization and algorithms. A traditional statistician will be strong in the former but not necessarily in the other two.

One can of course take the 'what sticks to the wall approach' but it would be a mischaracterization to state that a principled approach does not exist in ML literature. There are theorems stating when to try one over the other and then what to expect.

The difference in stats and ML can be seen in the nature of the theorems in their body of literature.

Lets take the probabilistic setting of ML, because this matches with the assumption a sophisticated statistician would make. ML has techniques where you don't need a stochastic model, those settings are similar to adversarial game theoretic fundamentals that does not involve a stochastic sampling process.

The assumptions:

We have a joint distribution over random variables X,Y and have drawn a sample of size n from it. We have another sample from the same joint distribution but the Ys have been removed and we want to get those back.

The baby statistician's way:

God came to me in my dream and told me every thing about a function f(X, \theta) that would recover the ys. He conveyed infinite amount of information but left out a finite number of parameters (settings of knobs) for me to figure out. I was told that the specific \theta lives in the set \Theta.

The mature statistician's way:

Almost the same as above except that \theta would be living in an infinite dimensional space. In other words he can work with a less powerful god who leaves out infinite amount of information.

The theorems: A statistician would be interested in theorems where they show that their method manages to recover the \theta that god left out.

In other words they would like to find an estimator \hat{\theta} that for an infinitely sized initial sample converges to the true \theta. The crowning accomplishments are when these theorems prove that this happens with probability 1. Weaker ones are those that state that this happens with as high a probability one wants (but never quite reaching 1).

ML way:

A machine learner is not interested in theorems that characterize how close \hat{\theta} is to true \theta. Furthermore asymptotically we are all dead, so he is interested in finite sample results. His point would be that all this god mumbo-jumbo and \theta is a piece of fiction we invented anyway. No one will ever be able see these, even in principle, so he couldn't care less about those. He would care about theorems that show how the function value (and not its parameters) converges uniformly to the best function in a class of functions \cal F_n of his choosing and using finitely many samples. He would also be interested in finding a sequence of function classes so that one can consider richer and richer supersets of function classes as he get more and more data. The regularization business falls out of the function classes he wants to look at. A closely related statistical viewpoint would be those of sieves.

This is not an entirely correct characterization because there existed (and exists) probabilist and statisticians who were interested in prediction theorems rather than parameter recovery theorems, but they were far far fewer in number. An example would be Prof. Dawid.

Nice comment, " the now classic texts in regression analysis start with something the machine learning curve fitting does not...". What classic texts would you recommend starting with?
Part I

A relatively good, introductory text on statistics, starting without either calculus or linear algebra, with nice progress into experimental design and analysis of variance, that is, multivariate statistics with discrete data, from some serious experts in that field, from University of Iowa, for some serious reasons, how to maximize corn yields considering soil chemistry, water, seed variety, plowing techniques, fertilizer, etc., long commonly used for undergraduates in the social sciences, e.g., educational statistics, agriculture, etc. is (with my mark-up for TeX):

George W.\ Snedecor and William G.\ Cochran, {\it Statistical Methods, Sixth Edition.\/}

A good, first, no toy, text on regression, with minimal prerequisites:

N.\ R.\ Draper and H.\ Smith, {\it Applied Regression Analysis.\/}

Some good books on regression and its usual generalizations, e.g., IIRC, factor analysis, i.e., principle components, discriminate analysis, etc.:

Maurice M.\ Tatsuoka, {\it Multivariate Analysis: Techniques for Educational and Psychological Research.\/}

Donald F.\ Morrison, {\it Multivariate Statistical Methods: Second Edition.\/}

William W.\ Cooley and Paul R.\ Lohnes, {\it Multivariate Data Analysis.\/}

A mathematically relatively serious text on regression:

C.\ Radhakrishna Rao, {\it Linear Statistical Inference and Its Applications:\ \ Second Edition.\/}

For multivariate statistics with discrete data, consider

George W.\ Snedecor and William G.\ Cochran, {\it Statistical Methods, Sixth Edition.\/}

Stephen E.\ Fienberg, {\it The Analysis of Cross-Classified Data.\/}

Yvonne M.\ M.\ Bishop, Stephen E.\ Fienberg, Paul W.\ Holland, {\it Discrete Multivariate Analysis:\ \ Theory and Practice.\/}

Shelby J.\ Haberman, {\it Analysis of Qualitative Data, Volume 1, Introductory Topics.\/}

Shelby J.\ Haberman, {\it Analysis of Qualitative Data, Volume 2, New Developments.\/}

The classic on the math of analysis of variance:

Henry Scheff\'e, {\it Analysis of Variance.\/}

Broadly, in all of this, we are trying to analyze data on, call it, several variables, make predictions, etc.

In all of this, and as in the title

{\it The Analysis of Cross-Classified Data.\/}

above, suppose we have in mind random variables Y and X.

What is a random variable? Go outside. Measure something. Call that the value of random variable Y. What you measured was one value of possibly many that you might have measured. Considering all those possible values, there is a cumulative distribution F_Y(y) that for any real number y we have the probability that random variable Y is <= real number y

P(Y <= y) = F_Y(y).

So, F_Y(y) is defined for all real numbers y, is at 0 at the limit of y at minus infinity and at 1 at the limit of y at plus infinity. So, as we move real number y from left to right, F_Y(y) increases -- monotonically. On a nice day, function F_Y is differentiable, and with the derivative from calculus

f_Y(y) = d/dy F_Y(y)

and is the probability density of real random variable Y.

Here's the standard way to discover something about Y, in particular about its cumulative distribution F_Y:

We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent of Y and that have the same cumulative distribution as Y. Then for positive integer n, and for real number y, by the law of large numbers (the weak version has an easy proof), in the limit as n grows large, as accurately as we please, the fraction of the values

Y_1, Y_2, ...,Y_n

that are <= y

is F_Y(y). So, via such simple random sampling, for any real number y we can estimate F_Y(y), the cumulative distribution of Y.

For a little more, under meager assumptions that hold nearly universally in practice, if we take the ordinary grade school average of

Y_1, Y_2, ..., Y_n

as n increases we will approximate the average or expected value of Y denoted by E[Y].

To define the expected value, we can use some calculus and the cumulative distribution of F_Y, but for now let's just use our intuition about averages and move along.

Now suppose we are also given random variable X. Maybe the values of X are just real numbers, some 10 real numbers, 20 values, from set {1, 2, 3}, the last three weeks 100 times a second of the NYSE price of Microsoft, or full details on the atmosphere of earth every microsecond for the past 5 billion years. That is, for the values of X we can accept a lot of generality. Still more generality is possible, but that would take us on a detour for a while.

For our point here, let's assume that X takes on just discrete values or we have just rounded off the values and forced them to be discrete. In practice we will have only finitely many discrete values.

Now we want to use X to predict Y.

So, sure, much of machine learning is to construct a model, maybe with regression trees or neural networks, to make this prediction, but here we will show a simpler way that is always the most accurate possible whenever we have enough data. How 'bout that!

This simpler way is just old cross tabulation.

Net, over a wide range of real cases trying to predict Y from X, we should just use cross tabulation unless we don't have enough data. Or, the main reason for just empirical curve fitting using regression linear models or neural network continuous models is that we don't have enough data just to use cross tabulation.

For a preview of a coming attraction, will notice in nearly all of regression and neural networks big concerns about over fitting. Well, cross tabulation doesn't have that problem. How 'bout that!

Errata: "We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent of Y and that have the same cumulative distribution as Y."

should read

"We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent and that have the same cumulative distribution as Y."

"To define the expected value, we can use some calculus and the cumulative distribution of F_Y, but for now let's just use our intuition about averages and move along."

should read

"To define the expected value, we can use some calculus and the cumulative distribution F_Y, but for now let's just use our intuition about averages and move along."

Part II

So, let's move on to cross tabulation:

Let real number x be some value of random variable X. Suppose we take the cumulative distribution of Y when X = x (X is a random variable that might take on many different values; x is just some real number). E.g., maybe when X = x = 3 (X is a random variable; x is just the real number 3) we find the conditional probability that Y <= y given that X = x, that is, we find P(Y <= y|X = x) = F_{Y|X}(y) (Y is a random variable; y is just some real number; Y|X is a subscript).

Or we imagine a thick chocolate cake with one edge the Y axis and another edge the X axis. The surface of the cake is the joint probability density of random variables X and Y. At X = x, we cut the cake parallel to the Y axis and perpendicular to the X axis. We look at the cut surface -- that's essentially the conditional probability density of Y given X = x.

Note: In an advanced course in probability, we can show that it makes sense to consider the density at individual values of x one at a time -- the key to doing so is the subject measure theory used as the foundation for probability theory as in the 1933 paper of A. Kolmogorov.

Well, with that cut surface, we take its area and divide by it so that we have scaled the cut area to have area 1 so that the cut surface is a probability density -- we are being really intuitive here but essentially correct.

We now have the conditional probability density of random variable Y given random variable X when X = x for real number x.

Or, suppose Y is height and X is weight. With X = 110, we get the probability density distribution of height for people who weigh 110.

So, we are on the way to using weight to predict height -- no joke, this is close to fully seriously the most accurate way to proceed in the context.

If X is not just a number but two numbers, weight and gender, say, 0 for female and 1 for male, then, with weight 110 and gender female, we get the distribution of height for 110 pound females. If X has three components, with the third 1 for cheerleaders and 0 otherwise, then we can predict height given weight of female cheerleaders. So, we are into predicting one random variable, height, from three components, weight, gender, and cheerleader or not, of random variable X. Sure, the component weight is also a random variable, and similarly for gender and cheerleader or not. So, if you will, we are predicting random variable Y height from three random variable or three components of the one random variable X -- your choice.

Since we can get a conditional distribution, we can take the expectation of the distribution and get a conditional expectation. So, the conditional expectation of Y given that X = x is written E[Y|X = x]. This is correct, but with details we could spend a hour here so let's just move along.

Well, the conditional expectation

E[Y|X = x]

is a function of the real number x. We can also regard it as more simply a function of the random variable X and write this as E[Y|X], the conditional expectation of random variable Y given (the values of) random variable X. So, for some function f with domain the values of X, we have f(X) = E[Y|X]. So, under meager assumptions that essentially always hold in practice f(X) is also a random variable.

It turns out, nicely enough, that the expectation of E[Y|X] is the expectation of Y itself;

E[E[Y|X]] = E[f(X)] = E[Y]

We want to show that f(X) = E[Y|X] is the least squares (minimum squared error) approximation of Y possible from X. So f(X) = E[Y|X] is the best predictor of Y, better than anything we could hope to do with regression, neural networks, anything linear, anything nonlinear, essentially anything at all.

For the proof, we start with something simple: We show that a = E[Y] minimizes E[(Y - a)^2]. That is, we are finding the single number a that best approximates real random variable Y in the least squares sense. Well, we have

E[(Y - a)^2]

= E[Y^2 - 2 Ya + a^2]

= E[Y^2] - 2aE[Y] + a^2

= E[Y^2] + E[Y]^2 - 2aE[Y] + a^2 - E[Y]^2

= E[Y^2] + (E[Y] - a)^2 - E[Y]^2

which we minimize with E[Y] = a.

Or, for one interpretation, the minimum rotational moment of inertia is for rotation about the center of mass.

So, for our main concern, suppose we want to use the data we have X to approximate Y. So, we want real valued function g with domain the possible values of X so that g(X) approximates Y.

For the most accurate approximation, we want to minimize

E[(Y - g(X))^2]

Claim: For g(X) we want

g(X) = E[Y|X]

So, g is the best non-linear least squares approximation to Y.

Proof:

We start by using one of the properties of conditional expectation and then continue with just simple algebra much as we did for E[(Y - a)^2] just above:

E[(Y - g(X))^2]

= E[ E[Y^2 - 2Yg(X) + g(X)^2|X] ]

= E[ E[Y^2|X] - 2g(X)E[Y|X]

+ g(X)^2 ]

= E[ E[Y^2|X] E[Y|X]^2 - 2g(X)E[Y|X]

+ g(X)^2 - E[Y|X]^2 ]

= E[ E[Y^2|X]

+ (E[Y|X] - g(X))^2

- E[Y|X]^2 ]

which is minimized with

g(X) = E[Y|X]

Done.

And how do we use this?

We just discretize X and use cross tabulation which directly estimates E[Y|X]. To be more clear, given real number x, to predict Y when X = x, we use cross tabulation to estimate E[Y|X = x] = f(x).

The good news is what we have proven: We have made the best prediction in the least squares sense of Y possible from our data on X.

The bad news is that the amount of data we need grows exponentially in the number of different components in X. I.e., we considered three variables in X, weight, gender, and cheerleader. The amount of data we need grows as the number of discrete values of X which grows exponentially in the number of components in X.

If somehow we KNOW that Y is a linear function of the components of X, then we might consider using regression. If we are doing just empirical curve fitting and have enough data, which in practice likely means relatively few components in X, then we can consider using just cross tabulation. But if X has thousands of components, then the exponential explosion in the amount of data required on X likely means that we will have to set aside cross tabulation, look for a way to use, maybe, a few dozen components of X, or go ahead and use the machine learning material in the Bloomberg course.

Uh, if are trying to predict people, there is an old hint -- people are no more than about 14 dimensional beings. That is, given 1000 variables on each of 10 million people, using principle components we should be able to reproduce the 1000 variables quite accurately using just 14 principle components obtained from the 1000 with just a linear transformation. Then use cross tabulation on the 14 variables -- maybe.

But we've been considering big data, right?

Errata: "So, if you will, we are predicting random variable Y height from three random variable or three components of the one random variable X -- your choice."

should read

"So, if you will, we are predicting random variable Y height from three random variables or three components of the one random variable X -- your choice."

Nice you just provided the solution to Homework 1, Problem 3.1 (https://davidrosenberg.github.io/mlcourse/Homework/hw1.pdf).
My solution never mentioned Bayes! So, don't have to "be a Bayesian" to use what I wrote.

I just used conditional expectation which, of course, with full details, is from the Radon-Nikodym theorem in measure theory as in, say, Rudin, Real and Complex Analysis with a nice, novel proof from von Neumann.

Yes, of course. A “Bayes prediction function” has nothing to do with Bayesian. Bayes had a lot of things named after him ;)
Errata: "So, under meager assumptions that essentially always hold in practice f(X) is also a random variable."

should read

"Then f(X) is also a random variable."

You seem to have a preference for an approach in which you assume certain things are true about the world (e.g. y is a linear function of x), and then you derive some optimal prediction function, based on that assumption, under some definition of optimal. And that seems fine. In that example, you'd end up with the same results if you decided to restrict your search for a prediction function to a hypothesis space containing only linear functions -- not because you think the world obeys a linear function, but because you happen to like working with linear functions. I do agree that you can get insight into a method by knowing when it's the optimal method. We do talk about conditional probability models in Lecture 17, where we can assume that distribution of y has a specific form given x (although again we frame it as restricting your search space, rather than as an assumption about the world).

About the "throwing stuff against the wall until something appears to stick." First of all, I don't entirely object to this approach, in the sense that I don't think it's dangerous, so long as you follow the standard procedures of machine learning. And it's where somebody would be if, for example, they went through Lecture 1 on Black Box Machine Learning. But the other 29 lectures are building towards more than that. For example, in Lecture 6 and 7 we get a pretty careful understanding of how L1 and L2 regularization behave in different situations. In Lecture 19, we connect L1 and L2 regularization to various prior beliefs you have about the values of the true coefficients (in a Bayesian framework). We do examine the hypothesis spaces of trees, boosted trees, and neural networks, and consider the tradeoffs (piecewise constant vs smooth, trainable by gradient methods vs not). Yes, there is absolutely plenty of "just try it" in machine learning. Most of the theory of machine learning (generalization bounds, etc) is about when it's ok to estimate performance on future data based on performance on data you have now. We don't have to believe the world obeys a certain model for this to work, we only have to believe that the world will behave the same way in the future as it does now.

It's unfortunate that there wasn't more time in the class for factor analysis, although we do have a thorough treatment of the EM algorithm (Lecture 27), which is what you'd need for that. I used to give a similar argument about 'crosstab' and the curse of dimensionality (https://davidrosenberg.github.io/mlcourse/Archive/2015/Lectu...). What other methods would you have liked to see in the course? To scope it, the course is focused on making good prediction on future data.

> We do talk about conditional probability models in Lecture 17, where we can assume that distribution of y has a specific form given x (although again we frame it as restricting your search space, rather than as an assumption about the world).

With what I wrote, just use cross tabulation justified as the best least squares estimate just from my short derivation based on conditional expectation. Don't have to mention Bayes.

> assume that distribution of y has a specific form given x

With what I wrote, don't have to do that. To be quite precise, do need to assume that random variable Y has an expectation, but in practice that is a meager assumption. Otherwise in what I wrote, essentially don't have to make any assumptions at all about distributions (e.g., do want to assume that Y has an expectation).

Here I have a remark to the statistics community, especially the beginning students and their teachers, a remark old for me: Yes, we have random variables. Yes, each random variable, even the multidimensional ones, has a cumulative distribution. Yes, it's often not too much to assume that we have a probability density. Yes, there are some important probability densities, exponential, Poisson, ..., especially Gaussian. Yes, some of these densities have some amazing properties -- e.g., for the Gaussian, sample mean and variance are sufficient statistics (see a famous 1940s paper by Halmos and Savage and a later paper on lack of stability by E. Dynkin).

So, from that, a suggestion accepted implicitly is that in work in applied statistics early on we should seek to find "the distribution". If the usual suspects don't fit, then we should try some flexible alternatives and try to make them fit.

If you want some spreadsheet program to draw you a histogram, okay. With a good program, you can get a histogram for two dimensional data. If you strain with your eyes and some tricky software, you have an outside shot at three dimensions. For more than three dimensions, you are in Death Valley in July out'a gas and walking -- not quite hopeless yet but not in a good situation.

Sorry, on finding the distribution, really, essentially NOPE -- don't try that, basically can't do that. Blunt reality check: Yes, distributions exist but, no, usually can't find them and, actually, mostly work without knowing them. Maybe all you know is that variance is finite! This situation holds mostly true for real valued random variables. For multivariate valued random variables, e.g., vector valued, as the dimension rises above 2, are getting into trouble; before the dimension gets to be a dozen, are lost at sea; anything above a few dozen are lost in deep outer space without a paddle and hopeless.

> We don't have to believe the world obeys a certain model for this to work, we only have to believe that the world will behave the same way in the future as it does now.

Yes, classic texts on regression and some of what J. Tukey wrote about how regression has commonly been used in practice should have made this point more clear and gone ahead and been more clear about over fitting, etc.

Also, with this approach, given some positive integer n and some data (y_i, x_i), i = 1, 2, ..., n, maybe if we keep looking, e.g., fit with ln(x_i), sin(x_i), exp(x_i), ..., eventually might get a good fit. Then do we believe that the system behaved this way in the past so with meager assumptions will do so in the future? I confess, this is a small, philosophical point.

Better points:

(A) Just from the course, for any model building, we have a dozen, maybe more, efforts at TIFO -- try it and find out. That's a bit wasteful of time -- right, automate that, have the computers do all the usual dozen. Hmm.

(B) We are so short on assumptions, we will have to be short on consequences of theorems about what what we have or might have or might do. Again, we are like Ptolemy instead of Newton.

Yes, I neglected to mention the famous and appropriate "curse of dimensionality".

> What other methods would you have liked to see in the course?

As I wrote, there's a lot on the shelves of the research libraries. No one course can cover all that might be useful for solving real problems.

So, another approach is first to pick the problem and then pick the solution techniques. Even for some solution techniques that might be, even have been, useful, tough to say that they should be in a course. So, maybe what should be in a course is a start, lots of powerful, general views, that enable, given a particular problem, picking a suitable solution technique. And, even with this broad approach, it might be productive to have some specializations.

I have a gut feeling that some of the computer science community took regression a bit too seriously.

Sure, I really like and respect Leo Breiman and can respect his work on classification and regression trees (do have his book but never bothered to use or get the software), but I like his text Probability published by SIAM a lot more, have it, have studied it, along with Neveu, Chung, Loeve, Skorohod, etc.

So, to suggest a prediction technique not much related to regression? Okay: Blue has some subsurface ballistic missile (SSBN) boats, i.e., submarines that while submerged can fire intercontinental ballistic missiles, each with several nuclear warheads that can be independently targeted.

These SSBNs hide in the oceans of the world. To help remain hidden, they execute some approximation to Brownian motion. E.g., even if Red gets some data on where a Blue SSBN is, a day or so later Red will have much less good data on where it is.

If Red can find an SSBN, then sinking it is easy enough -- just explode a nuke over it. But if sink one, then that starts a nuclear war, and the remaining SSBNs might shoot. So, Red could try to find and sink all the SSBNs at once -- finding them all, all at the same time, is unlikely, and mounting a search effort that would promise to do that would be provocative.

But, for considering world war scenarios, maybe there would be a nuclear war but limited just to sea. Then could sink SSBNs one at a time as find them. In this scenario, how long might they last?

Getting a good answer is your mission, and you have to accept it! Gee, we could try regression analysis, maybe neural networks?

Or: How to find things at sea was considered back in WWII -- that was an important question then, also. There was a report OEG-56 by B. Koopman. The report argued, with some clever applied math, that with an okay approximation the encounter rates were as in a Poisson process with the encounter rate a known algebraic function, in the report, in terms of the area of the ocean (for real oceans, neglect the details of the shape), speed of the target, speed of the searcher, and detection radius, all IIRC.

So, ..., can argue that for the war at sea, encounters are like various Poisson processes. Altogether the Red-Blue inventory declines like a multi-dimensional, continuous time, discrete state space (huge state space) Markov process subordinated to a Poisson process (for a good start, see the E. Cinlar Introduction to Stochastic Processes).

So, given the speeds, radii, and what happens -- one dies, other dies, both die, neither die -- one on one, Red-Blue encounters, have all the data needed.

Now with that data, we can make some predictions. Sure, there's a closed form solution, but the "curse of dimensionality" gets in the way. But, Monte Carlo and the strong law of large numbers to the rescue: Write software for a few days and run it for a few minutes, run off 1000 sample paths, average, and get the prediction with some decently good confidence intervals.

I did that. Later the work was sold to a leading intelligence agency. I could tell you which one, but then I'd have to ...!

"Look, Ma, no regression analysis!".

I've got some more, but this post is getting long!

Lesson: There's a lot more to applied statistics than empirical curve fitting.

Hehe ok —- I also love Breiman’s Probability book. It’s really a standout on Ergodic theory. And Breiman et al.’s book on Trees is surprisingly rich, talking about all sorts of stuff besides trees.
As much as I respect Brieman, I think he latched on too hard on his pet theory that all that ensembles do is reduce variance, and by doing so missed out on what boosting does.

Yeah, random forests work really well but they are layers and layers of hacks, thumb rules and intuition piled on top of the other. I cant claim with a straight face that any of them follows from solid principles.

Graycat and I have a history of discussing the differences between stats and ML here on HN. I just added a comment, up streams on the thread.

I imagine Breiman was just talking about bagging-style parallel ensembles, when he was talking about variance reduction, not boosting-style sequential ensembles. Not long before he died, he was still actively trying to figure out why AdaBoost “works”. Don’t think he claimed to really understand that. He had experimental results that disputed the “it’s just maximizing the margin” explanation.

Saw the comments above — are you from a stats or ML background, or neither?

I am more ML than stats. BTW Brieman believed the same for Boosting. Later he got a little unsure. You will find this in his writings on Boosting
For ergodic theory, there were a few really good lectures, and I have good notes, in a course I took from a star student of E. Cinlar.

For Breiman Probability:

(1) I liked his start, nice, simple, intuitive, on how the law of large numbers works. Of course the super nice proof is from the martingale convergence theorem, but Breiman's start is nice.

(2) His book has the nicest start on diffusions I have seen.

(3) Once in some of my work I ran into a martingale. So, then, when I wanted to review martingale theory, I went to Breiman. Nice.

(4) Since I was interested in stochastic optimal control, I was interested in measurable selection and regular conditional distributions, and I got those from Breiman -- thanks Leo!!!! Yes, IIRC, there is the relevant Sierpinski exercise in Halmos, Measure Theory and also in Loeve, Probability, but it was Breiman who gave me what I really needed.

(5) Generally Breiman is easy to read. He writes like he is trying to help the reader learn.