Hacker News new | ask | show | jobs
by graycat 2904 days ago
Part II

So, let's move on to cross tabulation:

Let real number x be some value of random variable X. Suppose we take the cumulative distribution of Y when X = x (X is a random variable that might take on many different values; x is just some real number). E.g., maybe when X = x = 3 (X is a random variable; x is just the real number 3) we find the conditional probability that Y <= y given that X = x, that is, we find P(Y <= y|X = x) = F_{Y|X}(y) (Y is a random variable; y is just some real number; Y|X is a subscript).

Or we imagine a thick chocolate cake with one edge the Y axis and another edge the X axis. The surface of the cake is the joint probability density of random variables X and Y. At X = x, we cut the cake parallel to the Y axis and perpendicular to the X axis. We look at the cut surface -- that's essentially the conditional probability density of Y given X = x.

Note: In an advanced course in probability, we can show that it makes sense to consider the density at individual values of x one at a time -- the key to doing so is the subject measure theory used as the foundation for probability theory as in the 1933 paper of A. Kolmogorov.

Well, with that cut surface, we take its area and divide by it so that we have scaled the cut area to have area 1 so that the cut surface is a probability density -- we are being really intuitive here but essentially correct.

We now have the conditional probability density of random variable Y given random variable X when X = x for real number x.

Or, suppose Y is height and X is weight. With X = 110, we get the probability density distribution of height for people who weigh 110.

So, we are on the way to using weight to predict height -- no joke, this is close to fully seriously the most accurate way to proceed in the context.

If X is not just a number but two numbers, weight and gender, say, 0 for female and 1 for male, then, with weight 110 and gender female, we get the distribution of height for 110 pound females. If X has three components, with the third 1 for cheerleaders and 0 otherwise, then we can predict height given weight of female cheerleaders. So, we are into predicting one random variable, height, from three components, weight, gender, and cheerleader or not, of random variable X. Sure, the component weight is also a random variable, and similarly for gender and cheerleader or not. So, if you will, we are predicting random variable Y height from three random variable or three components of the one random variable X -- your choice.

Since we can get a conditional distribution, we can take the expectation of the distribution and get a conditional expectation. So, the conditional expectation of Y given that X = x is written E[Y|X = x]. This is correct, but with details we could spend a hour here so let's just move along.

Well, the conditional expectation

E[Y|X = x]

is a function of the real number x. We can also regard it as more simply a function of the random variable X and write this as E[Y|X], the conditional expectation of random variable Y given (the values of) random variable X. So, for some function f with domain the values of X, we have f(X) = E[Y|X]. So, under meager assumptions that essentially always hold in practice f(X) is also a random variable.

It turns out, nicely enough, that the expectation of E[Y|X] is the expectation of Y itself;

E[E[Y|X]] = E[f(X)] = E[Y]

We want to show that f(X) = E[Y|X] is the least squares (minimum squared error) approximation of Y possible from X. So f(X) = E[Y|X] is the best predictor of Y, better than anything we could hope to do with regression, neural networks, anything linear, anything nonlinear, essentially anything at all.

For the proof, we start with something simple: We show that a = E[Y] minimizes E[(Y - a)^2]. That is, we are finding the single number a that best approximates real random variable Y in the least squares sense. Well, we have

E[(Y - a)^2]

= E[Y^2 - 2 Ya + a^2]

= E[Y^2] - 2aE[Y] + a^2

= E[Y^2] + E[Y]^2 - 2aE[Y] + a^2 - E[Y]^2

= E[Y^2] + (E[Y] - a)^2 - E[Y]^2

which we minimize with E[Y] = a.

Or, for one interpretation, the minimum rotational moment of inertia is for rotation about the center of mass.

So, for our main concern, suppose we want to use the data we have X to approximate Y. So, we want real valued function g with domain the possible values of X so that g(X) approximates Y.

For the most accurate approximation, we want to minimize

E[(Y - g(X))^2]

Claim: For g(X) we want

g(X) = E[Y|X]

So, g is the best non-linear least squares approximation to Y.

Proof:

We start by using one of the properties of conditional expectation and then continue with just simple algebra much as we did for E[(Y - a)^2] just above:

E[(Y - g(X))^2]

= E[ E[Y^2 - 2Yg(X) + g(X)^2|X] ]

= E[ E[Y^2|X] - 2g(X)E[Y|X]

+ g(X)^2 ]

= E[ E[Y^2|X] E[Y|X]^2 - 2g(X)E[Y|X]

+ g(X)^2 - E[Y|X]^2 ]

= E[ E[Y^2|X]

+ (E[Y|X] - g(X))^2

- E[Y|X]^2 ]

which is minimized with

g(X) = E[Y|X]

Done.

And how do we use this?

We just discretize X and use cross tabulation which directly estimates E[Y|X]. To be more clear, given real number x, to predict Y when X = x, we use cross tabulation to estimate E[Y|X = x] = f(x).

The good news is what we have proven: We have made the best prediction in the least squares sense of Y possible from our data on X.

The bad news is that the amount of data we need grows exponentially in the number of different components in X. I.e., we considered three variables in X, weight, gender, and cheerleader. The amount of data we need grows as the number of discrete values of X which grows exponentially in the number of components in X.

If somehow we KNOW that Y is a linear function of the components of X, then we might consider using regression. If we are doing just empirical curve fitting and have enough data, which in practice likely means relatively few components in X, then we can consider using just cross tabulation. But if X has thousands of components, then the exponential explosion in the amount of data required on X likely means that we will have to set aside cross tabulation, look for a way to use, maybe, a few dozen components of X, or go ahead and use the machine learning material in the Bloomberg course.

Uh, if are trying to predict people, there is an old hint -- people are no more than about 14 dimensional beings. That is, given 1000 variables on each of 10 million people, using principle components we should be able to reproduce the 1000 variables quite accurately using just 14 principle components obtained from the 1000 with just a linear transformation. Then use cross tabulation on the 14 variables -- maybe.

But we've been considering big data, right?

3 comments

Errata: "So, if you will, we are predicting random variable Y height from three random variable or three components of the one random variable X -- your choice."

should read

"So, if you will, we are predicting random variable Y height from three random variables or three components of the one random variable X -- your choice."

Nice you just provided the solution to Homework 1, Problem 3.1 (https://davidrosenberg.github.io/mlcourse/Homework/hw1.pdf).
My solution never mentioned Bayes! So, don't have to "be a Bayesian" to use what I wrote.

I just used conditional expectation which, of course, with full details, is from the Radon-Nikodym theorem in measure theory as in, say, Rudin, Real and Complex Analysis with a nice, novel proof from von Neumann.

Yes, of course. A “Bayes prediction function” has nothing to do with Bayesian. Bayes had a lot of things named after him ;)
Errata: "So, under meager assumptions that essentially always hold in practice f(X) is also a random variable."

should read

"Then f(X) is also a random variable."