| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by potbelly83 2904 days ago
	Nice comment, " the now classic texts in regression analysis start with something the machine learning curve fitting does not...". What classic texts would you recommend starting with?

1 comments

graycat 2904 days ago

Part I

A relatively good, introductory text on statistics, starting without either calculus or linear algebra, with nice progress into experimental design and analysis of variance, that is, multivariate statistics with discrete data, from some serious experts in that field, from University of Iowa, for some serious reasons, how to maximize corn yields considering soil chemistry, water, seed variety, plowing techniques, fertilizer, etc., long commonly used for undergraduates in the social sciences, e.g., educational statistics, agriculture, etc. is (with my mark-up for TeX):

George W.\ Snedecor and William G.\ Cochran, {\it Statistical Methods, Sixth Edition.\/}

A good, first, no toy, text on regression, with minimal prerequisites:

N.\ R.\ Draper and H.\ Smith, {\it Applied Regression Analysis.\/}

Some good books on regression and its usual generalizations, e.g., IIRC, factor analysis, i.e., principle components, discriminate analysis, etc.:

Maurice M.\ Tatsuoka, {\it Multivariate Analysis: Techniques for Educational and Psychological Research.\/}

Donald F.\ Morrison, {\it Multivariate Statistical Methods: Second Edition.\/}

William W.\ Cooley and Paul R.\ Lohnes, {\it Multivariate Data Analysis.\/}

A mathematically relatively serious text on regression:

C.\ Radhakrishna Rao, {\it Linear Statistical Inference and Its Applications:\ \ Second Edition.\/}

For multivariate statistics with discrete data, consider

George W.\ Snedecor and William G.\ Cochran, {\it Statistical Methods, Sixth Edition.\/}

Stephen E.\ Fienberg, {\it The Analysis of Cross-Classified Data.\/}

Yvonne M.\ M.\ Bishop, Stephen E.\ Fienberg, Paul W.\ Holland, {\it Discrete Multivariate Analysis:\ \ Theory and Practice.\/}

Shelby J.\ Haberman, {\it Analysis of Qualitative Data, Volume 1, Introductory Topics.\/}

Shelby J.\ Haberman, {\it Analysis of Qualitative Data, Volume 2, New Developments.\/}

The classic on the math of analysis of variance:

Henry Scheff\'e, {\it Analysis of Variance.\/}

Broadly, in all of this, we are trying to analyze data on, call it, several variables, make predictions, etc.

In all of this, and as in the title

{\it The Analysis of Cross-Classified Data.\/}

above, suppose we have in mind random variables Y and X.

What is a random variable? Go outside. Measure something. Call that the value of random variable Y. What you measured was one value of possibly many that you might have measured. Considering all those possible values, there is a cumulative distribution F_Y(y) that for any real number y we have the probability that random variable Y is <= real number y

P(Y <= y) = F_Y(y).

So, F_Y(y) is defined for all real numbers y, is at 0 at the limit of y at minus infinity and at 1 at the limit of y at plus infinity. So, as we move real number y from left to right, F_Y(y) increases -- monotonically. On a nice day, function F_Y is differentiable, and with the derivative from calculus

f_Y(y) = d/dy F_Y(y)

and is the probability density of real random variable Y.

Here's the standard way to discover something about Y, in particular about its cumulative distribution F_Y:

We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent of Y and that have the same cumulative distribution as Y. Then for positive integer n, and for real number y, by the law of large numbers (the weak version has an easy proof), in the limit as n grows large, as accurately as we please, the fraction of the values

Y_1, Y_2, ...,Y_n

that are <= y

is F_Y(y). So, via such simple random sampling, for any real number y we can estimate F_Y(y), the cumulative distribution of Y.

For a little more, under meager assumptions that hold nearly universally in practice, if we take the ordinary grade school average of

Y_1, Y_2, ..., Y_n

as n increases we will approximate the average or expected value of Y denoted by E[Y].

To define the expected value, we can use some calculus and the cumulative distribution of F_Y, but for now let's just use our intuition about averages and move along.

Now suppose we are also given random variable X. Maybe the values of X are just real numbers, some 10 real numbers, 20 values, from set {1, 2, 3}, the last three weeks 100 times a second of the NYSE price of Microsoft, or full details on the atmosphere of earth every microsecond for the past 5 billion years. That is, for the values of X we can accept a lot of generality. Still more generality is possible, but that would take us on a detour for a while.

For our point here, let's assume that X takes on just discrete values or we have just rounded off the values and forced them to be discrete. In practice we will have only finitely many discrete values.

Now we want to use X to predict Y.

So, sure, much of machine learning is to construct a model, maybe with regression trees or neural networks, to make this prediction, but here we will show a simpler way that is always the most accurate possible whenever we have enough data. How 'bout that!

This simpler way is just old cross tabulation.

Net, over a wide range of real cases trying to predict Y from X, we should just use cross tabulation unless we don't have enough data. Or, the main reason for just empirical curve fitting using regression linear models or neural network continuous models is that we don't have enough data just to use cross tabulation.

For a preview of a coming attraction, will notice in nearly all of regression and neural networks big concerns about over fitting. Well, cross tabulation doesn't have that problem. How 'bout that!

link

graycat 2904 days ago

Errata: "We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent of Y and that have the same cumulative distribution as Y."

should read

"We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent and that have the same cumulative distribution as Y."

"To define the expected value, we can use some calculus and the cumulative distribution of F_Y, but for now let's just use our intuition about averages and move along."

should read

"To define the expected value, we can use some calculus and the cumulative distribution F_Y, but for now let's just use our intuition about averages and move along."

link

graycat 2904 days ago

Part II

So, let's move on to cross tabulation:

Let real number x be some value of random variable X. Suppose we take the cumulative distribution of Y when X = x (X is a random variable that might take on many different values; x is just some real number). E.g., maybe when X = x = 3 (X is a random variable; x is just the real number 3) we find the conditional probability that Y <= y given that X = x, that is, we find P(Y <= y|X = x) = F_{Y|X}(y) (Y is a random variable; y is just some real number; Y|X is a subscript).

Or we imagine a thick chocolate cake with one edge the Y axis and another edge the X axis. The surface of the cake is the joint probability density of random variables X and Y. At X = x, we cut the cake parallel to the Y axis and perpendicular to the X axis. We look at the cut surface -- that's essentially the conditional probability density of Y given X = x.

Note: In an advanced course in probability, we can show that it makes sense to consider the density at individual values of x one at a time -- the key to doing so is the subject measure theory used as the foundation for probability theory as in the 1933 paper of A. Kolmogorov.

Well, with that cut surface, we take its area and divide by it so that we have scaled the cut area to have area 1 so that the cut surface is a probability density -- we are being really intuitive here but essentially correct.

We now have the conditional probability density of random variable Y given random variable X when X = x for real number x.

Or, suppose Y is height and X is weight. With X = 110, we get the probability density distribution of height for people who weigh 110.

So, we are on the way to using weight to predict height -- no joke, this is close to fully seriously the most accurate way to proceed in the context.

If X is not just a number but two numbers, weight and gender, say, 0 for female and 1 for male, then, with weight 110 and gender female, we get the distribution of height for 110 pound females. If X has three components, with the third 1 for cheerleaders and 0 otherwise, then we can predict height given weight of female cheerleaders. So, we are into predicting one random variable, height, from three components, weight, gender, and cheerleader or not, of random variable X. Sure, the component weight is also a random variable, and similarly for gender and cheerleader or not. So, if you will, we are predicting random variable Y height from three random variable or three components of the one random variable X -- your choice.

Since we can get a conditional distribution, we can take the expectation of the distribution and get a conditional expectation. So, the conditional expectation of Y given that X = x is written E[Y|X = x]. This is correct, but with details we could spend a hour here so let's just move along.

Well, the conditional expectation

E[Y|X = x]

is a function of the real number x. We can also regard it as more simply a function of the random variable X and write this as E[Y|X], the conditional expectation of random variable Y given (the values of) random variable X. So, for some function f with domain the values of X, we have f(X) = E[Y|X]. So, under meager assumptions that essentially always hold in practice f(X) is also a random variable.

It turns out, nicely enough, that the expectation of E[Y|X] is the expectation of Y itself;

E[E[Y|X]] = E[f(X)] = E[Y]

We want to show that f(X) = E[Y|X] is the least squares (minimum squared error) approximation of Y possible from X. So f(X) = E[Y|X] is the best predictor of Y, better than anything we could hope to do with regression, neural networks, anything linear, anything nonlinear, essentially anything at all.

For the proof, we start with something simple: We show that a = E[Y] minimizes E[(Y - a)^2]. That is, we are finding the single number a that best approximates real random variable Y in the least squares sense. Well, we have

E[(Y - a)^2]

= E[Y^2 - 2 Ya + a^2]

= E[Y^2] - 2aE[Y] + a^2

= E[Y^2] + E[Y]^2 - 2aE[Y] + a^2 - E[Y]^2

= E[Y^2] + (E[Y] - a)^2 - E[Y]^2

which we minimize with E[Y] = a.

Or, for one interpretation, the minimum rotational moment of inertia is for rotation about the center of mass.

So, for our main concern, suppose we want to use the data we have X to approximate Y. So, we want real valued function g with domain the possible values of X so that g(X) approximates Y.

For the most accurate approximation, we want to minimize

E[(Y - g(X))^2]

Claim: For g(X) we want

g(X) = E[Y|X]

So, g is the best non-linear least squares approximation to Y.

Proof:

We start by using one of the properties of conditional expectation and then continue with just simple algebra much as we did for E[(Y - a)^2] just above:

E[(Y - g(X))^2]

= E[ E[Y^2 - 2Yg(X) + g(X)^2|X] ]

= E[ E[Y^2|X] - 2g(X)E[Y|X]

+ g(X)^2 ]

= E[ E[Y^2|X] E[Y|X]^2 - 2g(X)E[Y|X]

+ g(X)^2 - E[Y|X]^2 ]

= E[ E[Y^2|X]

+ (E[Y|X] - g(X))^2

- E[Y|X]^2 ]

which is minimized with

g(X) = E[Y|X]

Done.

And how do we use this?

We just discretize X and use cross tabulation which directly estimates E[Y|X]. To be more clear, given real number x, to predict Y when X = x, we use cross tabulation to estimate E[Y|X = x] = f(x).

The good news is what we have proven: We have made the best prediction in the least squares sense of Y possible from our data on X.

The bad news is that the amount of data we need grows exponentially in the number of different components in X. I.e., we considered three variables in X, weight, gender, and cheerleader. The amount of data we need grows as the number of discrete values of X which grows exponentially in the number of components in X.

If somehow we KNOW that Y is a linear function of the components of X, then we might consider using regression. If we are doing just empirical curve fitting and have enough data, which in practice likely means relatively few components in X, then we can consider using just cross tabulation. But if X has thousands of components, then the exponential explosion in the amount of data required on X likely means that we will have to set aside cross tabulation, look for a way to use, maybe, a few dozen components of X, or go ahead and use the machine learning material in the Bloomberg course.

Uh, if are trying to predict people, there is an old hint -- people are no more than about 14 dimensional beings. That is, given 1000 variables on each of 10 million people, using principle components we should be able to reproduce the 1000 variables quite accurately using just 14 principle components obtained from the 1000 with just a linear transformation. Then use cross tabulation on the 14 variables -- maybe.

But we've been considering big data, right?

link

graycat 2904 days ago

Errata: "So, if you will, we are predicting random variable Y height from three random variable or three components of the one random variable X -- your choice."

should read

"So, if you will, we are predicting random variable Y height from three random variables or three components of the one random variable X -- your choice."

link

davidsrosenberg 2904 days ago

Nice you just provided the solution to Homework 1, Problem 3.1 (https://davidrosenberg.github.io/mlcourse/Homework/hw1.pdf).

link

graycat 2903 days ago

My solution never mentioned Bayes! So, don't have to "be a Bayesian" to use what I wrote.

I just used conditional expectation which, of course, with full details, is from the Radon-Nikodym theorem in measure theory as in, say, Rudin, Real and Complex Analysis with a nice, novel proof from von Neumann.

link

davidsrosenberg 2903 days ago

Yes, of course. A “Bayes prediction function” has nothing to do with Bayesian. Bayes had a lot of things named after him ;)

link

graycat 2903 days ago

Errata: "So, under meager assumptions that essentially always hold in practice f(X) is also a random variable."

should read

"Then f(X) is also a random variable."

link