Hacker News new | ask | show | jobs
by clircle 2904 days ago
Nice comment, echoes a lot of my feelings about ML. I have a question. You write

> (2) Okay, when simple regression doesn't fit very well, we keep trying? Okay. Say, we try logistic regression, ridge regression, L1 or L2 regularization regression, regression trees, boosting, ..., neural networks, etc. Uh, which is better, L1 regularization or L2 regularization? Uh, this sounds like throwing stuff against the wall until something appears to stick. Sure, that can work at times, but are we really satisfied with that? Wouldn't we want some more solid reasons for the tool we pick? Lots of places elsewhere in applied math, applied probability, and applied statistics we do have solid reasons.

I'm wondering what these "solid reasons" could be? Some sort of experience based on past data? An example would be helpful.

3 comments

Let's see: Pick a system that we know to be linear. Get a lot of pairs of real inputs and outputs and find the coefficients of the linear function that relates the inputs to the outputs. Then given a new input, can say what the output would be.

Okay, the function, system between a violin on the stage at Carnegie Hall and a seat near the roof is linear. So, I'm typing quickly here, do regression with a recording at the stage and at the seat and look for the coefficients in the convolution. Then given an oboe, can say what it will sound like in the seat.

Hooks law with small deflections is linear. So, take independent variables the forces on a space frame and the dependent variables the deflections and estimate all the spring stiffness values.

Take 200 recipes for tomato sauce all from the same 10 ingredients, for each recipe measure the weight of each ingredient and the weight of the protein in the final sauce and estimate the protein in each of the 10 ingredients. Then for any new tomato sauce recipe, weigh the ingredients and get the protein in the final sauce.

> I'm wondering what these "solid reasons" could be?

For the classic assumptions for regression, we know that a linear function exists, our data has additive errors that have mean zero, constant variance, and are Gaussian.

Then we can get an F ratio test on the regression, t-tests on the coefficients, and confidence intervals on the predicted values. This is just the standard stuff in the classic derivations in several of the books I listed.

And with some weaker assumptions, there are still some such results we can get.

This is just for regression.

There's a lot more in applied statistics, that is, where we make some assumptions that seem accurate enough in practice, get some theorems, and benefit from the conclusions of the theorems.

For an example, also in this tread I wrote about estimating how long some submarines would last. The assumptions were explicit and, on a nice day, somewhat credible. If swallow the assumptions, then have to take the conclusions seriously.

I've done other problems in applied statistics -- have a problem, collect some data, see what assumptions can make, from the assumptions, have some theorems, from the theorems have some conclusions powerful for the problem. If swallow the assumptions, trust the software, ..., then have to take the conclusions seriously. So, get to check the software and argue about the assumptions.

For machine learning as in the Bloomberg course, have some training data and some test, validation data. Fit to the training data. Check with the test data. Assume that the real world situation doesn't change, apply the model, and smile on the way to the bank.

I guess, okay when it works. But:

(A) how many valuable real applications are there? I.e., regression has been around with solid software for decades, and I've yet to see the yachts of the regression practitioners -- maybe there are some now but what is new might mostly be hype. Or the problem was that the SAS founders were not very good at sales? Same for SPSS (now pushed by IBM), Mathlab, R, etc.?

(B) Wouldn't we also want confidence intervals on the predictions? Okay, maybe there are some resampling/bootstrap ways to get those, maybe.

(C) There's more to applied statistics than empirical curve fitting, and commonly there we can have "solid reasons". For more on applied statistics, what I've done is only a drop in the bucket, ocean, but there research libraries are awash in examples.

Graycat and I have a history of conversations on HN regarding the differences between statistics and machine learning approaches.

ML is a term that came from Russian probabilists and statisticians. ML is applied probability, optimization and algorithms. A traditional statistician will be strong in the former but not necessarily in the other two.

One can of course take the 'what sticks to the wall approach' but it would be a mischaracterization to state that a principled approach does not exist in ML literature. There are theorems stating when to try one over the other and then what to expect.

The difference in stats and ML can be seen in the nature of the theorems in their body of literature.

Lets take the probabilistic setting of ML, because this matches with the assumption a sophisticated statistician would make. ML has techniques where you don't need a stochastic model, those settings are similar to adversarial game theoretic fundamentals that does not involve a stochastic sampling process.

The assumptions:

We have a joint distribution over random variables X,Y and have drawn a sample of size n from it. We have another sample from the same joint distribution but the Ys have been removed and we want to get those back.

The baby statistician's way:

God came to me in my dream and told me every thing about a function f(X, \theta) that would recover the ys. He conveyed infinite amount of information but left out a finite number of parameters (settings of knobs) for me to figure out. I was told that the specific \theta lives in the set \Theta.

The mature statistician's way:

Almost the same as above except that \theta would be living in an infinite dimensional space. In other words he can work with a less powerful god who leaves out infinite amount of information.

The theorems: A statistician would be interested in theorems where they show that their method manages to recover the \theta that god left out.

In other words they would like to find an estimator \hat{\theta} that for an infinitely sized initial sample converges to the true \theta. The crowning accomplishments are when these theorems prove that this happens with probability 1. Weaker ones are those that state that this happens with as high a probability one wants (but never quite reaching 1).

ML way:

A machine learner is not interested in theorems that characterize how close \hat{\theta} is to true \theta. Furthermore asymptotically we are all dead, so he is interested in finite sample results. His point would be that all this god mumbo-jumbo and \theta is a piece of fiction we invented anyway. No one will ever be able see these, even in principle, so he couldn't care less about those. He would care about theorems that show how the function value (and not its parameters) converges uniformly to the best function in a class of functions \cal F_n of his choosing and using finitely many samples. He would also be interested in finding a sequence of function classes so that one can consider richer and richer supersets of function classes as he get more and more data. The regularization business falls out of the function classes he wants to look at. A closely related statistical viewpoint would be those of sieves.

This is not an entirely correct characterization because there existed (and exists) probabilist and statisticians who were interested in prediction theorems rather than parameter recovery theorems, but they were far far fewer in number. An example would be Prof. Dawid.