| Part II So, let's move on to cross tabulation: Let real number x be some value of random
variable X. Suppose we take the cumulative
distribution of Y when X = x (X is a
random variable that might take on many
different values; x is just some real
number). E.g., maybe when X = x = 3 (X is
a random variable; x is just the real
number 3) we find the conditional
probability that Y <= y given that X = x,
that is, we find P(Y <= y|X = x) =
F_{Y|X}(y) (Y is a random variable; y is
just some real number; Y|X is a
subscript). Or we imagine a thick chocolate cake with
one edge the Y axis and another edge the X
axis. The surface of the cake is the
joint probability density of random
variables X and Y. At X = x, we cut the
cake parallel to the Y axis and
perpendicular to the X axis. We look at
the cut surface -- that's essentially the
conditional probability density of Y
given X = x. Note: In an advanced course in
probability, we can show that it makes
sense to consider the density at
individual values of x one at a time --
the key to doing so is the subject
measure theory used as the foundation
for probability theory as in the 1933
paper of A. Kolmogorov. Well, with that cut surface, we take its
area and divide by it so that we have
scaled the cut area to have area 1 so that
the cut surface is a probability density
-- we are being really intuitive here but
essentially correct. We now have the conditional probability
density of random variable Y given random
variable X when X = x for real number x. Or, suppose Y is height and X is weight.
With X = 110, we get the probability
density distribution of height for people
who weigh 110. So, we are on the way to using weight to
predict height -- no joke, this is close
to fully seriously the most accurate way
to proceed in the context. If X is not just a number but two numbers,
weight and gender, say, 0 for female and 1
for male, then, with weight 110 and gender
female, we get the distribution of height
for 110 pound females. If X has three
components, with the third 1 for
cheerleaders and 0 otherwise, then we can
predict height given weight of female
cheerleaders. So, we are into predicting
one random variable, height, from three
components, weight, gender, and
cheerleader or not, of random variable X.
Sure, the component weight is also a
random variable, and similarly for gender
and cheerleader or not. So, if you will,
we are predicting random variable Y height
from three random variable or three
components of the one random variable X --
your choice. Since we can get a conditional
distribution, we can take the expectation
of the distribution and get a conditional
expectation. So, the conditional
expectation of Y given that X = x is
written E[Y|X = x]. This is correct, but
with details we could spend a hour here so
let's just move along. Well, the conditional expectation E[Y|X = x] is a function of the real number x. We can
also regard it as more simply a function
of the random variable X and write this as
E[Y|X], the conditional expectation of
random variable Y given (the values of)
random variable X. So, for some function f
with domain the values of X, we have f(X)
= E[Y|X]. So, under meager assumptions
that essentially always hold in practice
f(X) is also a random variable. It turns out, nicely enough, that the
expectation of E[Y|X] is the expectation
of Y itself; E[E[Y|X]] = E[f(X)] = E[Y] We want to show that f(X) = E[Y|X] is the
least squares (minimum squared error)
approximation of Y possible from X. So
f(X) = E[Y|X] is the best predictor of Y,
better than anything we could hope to do
with regression, neural networks, anything
linear, anything nonlinear, essentially
anything at all. For the proof, we start with something
simple: We show that a = E[Y] minimizes
E[(Y - a)^2]. That is, we are finding the
single number a that best approximates
real random variable Y in the least
squares sense. Well, we have E[(Y - a)^2] = E[Y^2 - 2 Ya + a^2] = E[Y^2] - 2aE[Y] + a^2 = E[Y^2] + E[Y]^2 - 2aE[Y] + a^2 - E[Y]^2 = E[Y^2] + (E[Y] - a)^2 - E[Y]^2 which we minimize with E[Y] = a. Or, for one interpretation, the minimum
rotational moment of inertia is for
rotation about the center of mass. So, for our main concern, suppose we want
to use the data we have X to approximate
Y. So, we want real valued function g with
domain the possible values of X so that
g(X) approximates Y. For the most accurate approximation, we
want to minimize E[(Y - g(X))^2] Claim: For g(X) we want g(X) = E[Y|X] So, g is the best non-linear least squares
approximation to Y. Proof: We start by using one of the properties of
conditional expectation and then continue
with just simple algebra much as we did
for E[(Y - a)^2] just above: E[(Y - g(X))^2] = E[ E[Y^2 - 2Yg(X) + g(X)^2|X] ] = E[ E[Y^2|X] - 2g(X)E[Y|X] + g(X)^2 ] = E[ E[Y^2|X] E[Y|X]^2 - 2g(X)E[Y|X] + g(X)^2 - E[Y|X]^2 ] = E[ E[Y^2|X] + (E[Y|X] - g(X))^2 - E[Y|X]^2 ] which is minimized with g(X) = E[Y|X] Done. And how do we use this? We just discretize X and use cross
tabulation which directly estimates
E[Y|X]. To be more clear, given real
number x, to predict Y when X = x, we use
cross tabulation to estimate E[Y|X = x] =
f(x). The good news is what we have proven: We
have made the best prediction in the least
squares sense of Y possible from our data
on X. The bad news is that the amount of data we
need grows exponentially in the number of
different components in X. I.e., we
considered three variables in X, weight,
gender, and cheerleader. The amount of
data we need grows as the number of
discrete values of X which grows
exponentially in the number of components
in X. If somehow we KNOW that Y is a linear
function of the components of X, then we
might consider using regression. If we
are doing just empirical curve fitting and
have enough data, which in practice likely
means relatively few components in X, then
we can consider using just cross
tabulation. But if X has thousands of
components, then the exponential explosion
in the amount of data required on X likely
means that we will have to set aside cross
tabulation, look for a way to use, maybe,
a few dozen components of X, or go ahead
and use the machine learning material in
the Bloomberg course. Uh, if are trying to predict people, there
is an old hint -- people are no more than
about 14 dimensional beings. That is,
given 1000 variables on each of 10 million
people, using principle components we
should be able to reproduce the 1000
variables quite accurately using just 14
principle components obtained from the
1000 with just a linear transformation.
Then use cross tabulation on the 14
variables -- maybe. But we've been considering big data,
right? |
should read
"So, if you will, we are predicting random variable Y height from three random variables or three components of the one random variable X -- your choice."