Nice comment, " the now classic texts in regression analysis start with something the machine learning curve fitting does not...". What classic texts would you recommend starting with?
A relatively good, introductory text on
statistics, starting without either
calculus or linear algebra, with nice
progress into experimental design and
analysis of variance, that is,
multivariate statistics with discrete
data, from some serious experts in that
field, from University of Iowa, for some
serious reasons, how to maximize corn
yields considering soil chemistry, water,
seed variety, plowing techniques,
fertilizer, etc., long commonly used for
undergraduates in the social sciences,
e.g., educational statistics, agriculture,
etc. is (with my mark-up for TeX):
George W.\ Snedecor and William G.\
Cochran, {\it Statistical Methods, Sixth
Edition.\/}
A good, first, no toy, text on regression,
with minimal prerequisites:
N.\ R.\ Draper and H.\ Smith, {\it Applied
Regression Analysis.\/}
Some good books on regression and its
usual generalizations, e.g., IIRC, factor
analysis, i.e., principle components,
discriminate analysis, etc.:
Maurice M.\ Tatsuoka, {\it Multivariate
Analysis: Techniques for Educational and
Psychological Research.\/}
Donald F.\ Morrison, {\it Multivariate
Statistical Methods: Second Edition.\/}
William W.\ Cooley and Paul R.\ Lohnes,
{\it Multivariate Data Analysis.\/}
A mathematically relatively serious text
on regression:
C.\ Radhakrishna Rao, {\it Linear
Statistical Inference and Its
Applications:\ \ Second Edition.\/}
For multivariate statistics with discrete
data, consider
George W.\ Snedecor and William G.\
Cochran, {\it Statistical Methods, Sixth
Edition.\/}
Stephen E.\ Fienberg, {\it The Analysis of
Cross-Classified Data.\/}
Yvonne M.\ M.\ Bishop, Stephen E.\
Fienberg, Paul W.\ Holland, {\it Discrete
Multivariate Analysis:\ \ Theory and
Practice.\/}
Shelby J.\ Haberman, {\it Analysis of
Qualitative Data, Volume 2, New
Developments.\/}
The classic on the math of analysis of
variance:
Henry Scheff\'e, {\it Analysis of
Variance.\/}
Broadly, in all of this, we are trying to
analyze data on, call it, several
variables, make predictions, etc.
In all of this, and as in the title
{\it The Analysis of Cross-Classified
Data.\/}
above, suppose we have in mind random
variables Y and X.
What is a random variable? Go outside.
Measure something. Call that the value of
random variable Y. What you measured was
one value of possibly many that you might
have measured. Considering all those
possible values, there is a cumulative
distribution F_Y(y) that for any real
number y we have the probability that
random variable Y is <= real number y
P(Y <= y) = F_Y(y).
So, F_Y(y) is defined for all real numbers
y, is at 0 at the limit of y at minus
infinity and at 1 at the limit of y at
plus infinity. So, as we move real number
y from left to right, F_Y(y) increases --
monotonically. On a nice day, function
F_Y is differentiable, and with the
derivative from calculus
f_Y(y) = d/dy F_Y(y)
and is the probability density of real
random variable Y.
Here's the standard way to discover
something about Y, in particular about its
cumulative distribution F_Y:
We can imagine having random variables
Y_1, Y_2, ... that are, in the sense of
probability, independent of Y and that
have the same cumulative distribution as
Y. Then for positive integer n, and for
real number y, by the law of large numbers
(the weak version has an easy proof), in
the limit as n grows large, as accurately
as we please, the fraction of the values
Y_1, Y_2, ...,Y_n
that are <= y
is F_Y(y). So, via such simple random
sampling, for any real number y we can
estimate F_Y(y), the cumulative
distribution of Y.
For a little more, under meager
assumptions that hold nearly universally
in practice, if we take the ordinary grade
school average of
Y_1, Y_2, ..., Y_n
as n increases we will approximate the
average or expected value of Y denoted
by E[Y].
To define the expected value, we can use
some calculus and the cumulative
distribution of F_Y, but for now let's
just use our intuition about averages and
move along.
Now suppose we are also given random
variable X. Maybe the values of X are just
real numbers, some 10 real numbers, 20
values, from set {1, 2, 3}, the last three
weeks 100 times a second of the NYSE price
of Microsoft, or full details on the
atmosphere of earth every microsecond for
the past 5 billion years. That is, for
the values of X we can accept a lot of
generality. Still more generality is
possible, but that would take us on a
detour for a while.
For our point here, let's assume that X
takes on just discrete values or we have
just rounded off the values and forced
them to be discrete. In practice we will
have only finitely many discrete values.
Now we want to use X to predict Y.
So, sure, much of machine learning is to
construct a model, maybe with regression
trees or neural networks, to make this
prediction, but here we will show a
simpler way that is always the most
accurate possible whenever we have enough
data. How 'bout that!
This simpler way is just old cross
tabulation.
Net, over a wide range of real cases
trying to predict Y from X, we should just
use cross tabulation unless we don't have
enough data. Or, the main reason for just
empirical curve fitting using regression
linear models or neural network continuous
models is that we don't have enough data
just to use cross tabulation.
For a preview of a coming attraction, will
notice in nearly all of regression and
neural networks big concerns about over
fitting. Well, cross tabulation doesn't
have that problem. How 'bout that!
Errata: "We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent of Y and that have the same cumulative distribution as Y."
should read
"We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent and that have the same cumulative distribution as Y."
"To define the expected value, we can use some calculus and the cumulative distribution of F_Y, but for now let's just use our intuition about averages and move along."
should read
"To define the expected value, we can use some calculus and the cumulative distribution F_Y, but for now let's just use our intuition about averages and move along."
Let real number x be some value of random
variable X. Suppose we take the cumulative
distribution of Y when X = x (X is a
random variable that might take on many
different values; x is just some real
number). E.g., maybe when X = x = 3 (X is
a random variable; x is just the real
number 3) we find the conditional
probability that Y <= y given that X = x,
that is, we find P(Y <= y|X = x) =
F_{Y|X}(y) (Y is a random variable; y is
just some real number; Y|X is a
subscript).
Or we imagine a thick chocolate cake with
one edge the Y axis and another edge the X
axis. The surface of the cake is the
joint probability density of random
variables X and Y. At X = x, we cut the
cake parallel to the Y axis and
perpendicular to the X axis. We look at
the cut surface -- that's essentially the
conditional probability density of Y
given X = x.
Note: In an advanced course in
probability, we can show that it makes
sense to consider the density at
individual values of x one at a time --
the key to doing so is the subject
measure theory used as the foundation
for probability theory as in the 1933
paper of A. Kolmogorov.
Well, with that cut surface, we take its
area and divide by it so that we have
scaled the cut area to have area 1 so that
the cut surface is a probability density
-- we are being really intuitive here but
essentially correct.
We now have the conditional probability
density of random variable Y given random
variable X when X = x for real number x.
Or, suppose Y is height and X is weight.
With X = 110, we get the probability
density distribution of height for people
who weigh 110.
So, we are on the way to using weight to
predict height -- no joke, this is close
to fully seriously the most accurate way
to proceed in the context.
If X is not just a number but two numbers,
weight and gender, say, 0 for female and 1
for male, then, with weight 110 and gender
female, we get the distribution of height
for 110 pound females. If X has three
components, with the third 1 for
cheerleaders and 0 otherwise, then we can
predict height given weight of female
cheerleaders. So, we are into predicting
one random variable, height, from three
components, weight, gender, and
cheerleader or not, of random variable X.
Sure, the component weight is also a
random variable, and similarly for gender
and cheerleader or not. So, if you will,
we are predicting random variable Y height
from three random variable or three
components of the one random variable X --
your choice.
Since we can get a conditional
distribution, we can take the expectation
of the distribution and get a conditional
expectation. So, the conditional
expectation of Y given that X = x is
written E[Y|X = x]. This is correct, but
with details we could spend a hour here so
let's just move along.
Well, the conditional expectation
E[Y|X = x]
is a function of the real number x. We can
also regard it as more simply a function
of the random variable X and write this as
E[Y|X], the conditional expectation of
random variable Y given (the values of)
random variable X. So, for some function f
with domain the values of X, we have f(X)
= E[Y|X]. So, under meager assumptions
that essentially always hold in practice
f(X) is also a random variable.
It turns out, nicely enough, that the
expectation of E[Y|X] is the expectation
of Y itself;
E[E[Y|X]] = E[f(X)] = E[Y]
We want to show that f(X) = E[Y|X] is the
least squares (minimum squared error)
approximation of Y possible from X. So
f(X) = E[Y|X] is the best predictor of Y,
better than anything we could hope to do
with regression, neural networks, anything
linear, anything nonlinear, essentially
anything at all.
For the proof, we start with something
simple: We show that a = E[Y] minimizes
E[(Y - a)^2]. That is, we are finding the
single number a that best approximates
real random variable Y in the least
squares sense. Well, we have
E[(Y - a)^2]
= E[Y^2 - 2 Ya + a^2]
= E[Y^2] - 2aE[Y] + a^2
= E[Y^2] + E[Y]^2 - 2aE[Y] + a^2 - E[Y]^2
= E[Y^2] + (E[Y] - a)^2 - E[Y]^2
which we minimize with E[Y] = a.
Or, for one interpretation, the minimum
rotational moment of inertia is for
rotation about the center of mass.
So, for our main concern, suppose we want
to use the data we have X to approximate
Y. So, we want real valued function g with
domain the possible values of X so that
g(X) approximates Y.
For the most accurate approximation, we
want to minimize
E[(Y - g(X))^2]
Claim: For g(X) we want
g(X) = E[Y|X]
So, g is the best non-linear least squares
approximation to Y.
Proof:
We start by using one of the properties of
conditional expectation and then continue
with just simple algebra much as we did
for E[(Y - a)^2] just above:
E[(Y - g(X))^2]
= E[ E[Y^2 - 2Yg(X) + g(X)^2|X] ]
= E[ E[Y^2|X] - 2g(X)E[Y|X]
+ g(X)^2 ]
= E[ E[Y^2|X] E[Y|X]^2 - 2g(X)E[Y|X]
+ g(X)^2 - E[Y|X]^2 ]
= E[ E[Y^2|X]
+ (E[Y|X] - g(X))^2
- E[Y|X]^2 ]
which is minimized with
g(X) = E[Y|X]
Done.
And how do we use this?
We just discretize X and use cross
tabulation which directly estimates
E[Y|X]. To be more clear, given real
number x, to predict Y when X = x, we use
cross tabulation to estimate E[Y|X = x] =
f(x).
The good news is what we have proven: We
have made the best prediction in the least
squares sense of Y possible from our data
on X.
The bad news is that the amount of data we
need grows exponentially in the number of
different components in X. I.e., we
considered three variables in X, weight,
gender, and cheerleader. The amount of
data we need grows as the number of
discrete values of X which grows
exponentially in the number of components
in X.
If somehow we KNOW that Y is a linear
function of the components of X, then we
might consider using regression. If we
are doing just empirical curve fitting and
have enough data, which in practice likely
means relatively few components in X, then
we can consider using just cross
tabulation. But if X has thousands of
components, then the exponential explosion
in the amount of data required on X likely
means that we will have to set aside cross
tabulation, look for a way to use, maybe,
a few dozen components of X, or go ahead
and use the machine learning material in
the Bloomberg course.
Uh, if are trying to predict people, there
is an old hint -- people are no more than
about 14 dimensional beings. That is,
given 1000 variables on each of 10 million
people, using principle components we
should be able to reproduce the 1000
variables quite accurately using just 14
principle components obtained from the
1000 with just a linear transformation.
Then use cross tabulation on the 14
variables -- maybe.
Errata: "So, if you will, we are predicting random variable Y height from three random variable or three components of the one random variable X -- your choice."
should read
"So, if you will, we are predicting random variable Y height from three random variables or three components of the one random variable X -- your choice."
My solution never mentioned Bayes! So, don't have to "be a Bayesian" to use what I wrote.
I just used conditional expectation which, of course, with full details, is from the Radon-Nikodym theorem in measure theory as in, say, Rudin, Real and Complex Analysis with a nice, novel proof from von Neumann.
A relatively good, introductory text on statistics, starting without either calculus or linear algebra, with nice progress into experimental design and analysis of variance, that is, multivariate statistics with discrete data, from some serious experts in that field, from University of Iowa, for some serious reasons, how to maximize corn yields considering soil chemistry, water, seed variety, plowing techniques, fertilizer, etc., long commonly used for undergraduates in the social sciences, e.g., educational statistics, agriculture, etc. is (with my mark-up for TeX):
George W.\ Snedecor and William G.\ Cochran, {\it Statistical Methods, Sixth Edition.\/}
A good, first, no toy, text on regression, with minimal prerequisites:
N.\ R.\ Draper and H.\ Smith, {\it Applied Regression Analysis.\/}
Some good books on regression and its usual generalizations, e.g., IIRC, factor analysis, i.e., principle components, discriminate analysis, etc.:
Maurice M.\ Tatsuoka, {\it Multivariate Analysis: Techniques for Educational and Psychological Research.\/}
Donald F.\ Morrison, {\it Multivariate Statistical Methods: Second Edition.\/}
William W.\ Cooley and Paul R.\ Lohnes, {\it Multivariate Data Analysis.\/}
A mathematically relatively serious text on regression:
C.\ Radhakrishna Rao, {\it Linear Statistical Inference and Its Applications:\ \ Second Edition.\/}
For multivariate statistics with discrete data, consider
George W.\ Snedecor and William G.\ Cochran, {\it Statistical Methods, Sixth Edition.\/}
Stephen E.\ Fienberg, {\it The Analysis of Cross-Classified Data.\/}
Yvonne M.\ M.\ Bishop, Stephen E.\ Fienberg, Paul W.\ Holland, {\it Discrete Multivariate Analysis:\ \ Theory and Practice.\/}
Shelby J.\ Haberman, {\it Analysis of Qualitative Data, Volume 1, Introductory Topics.\/}
Shelby J.\ Haberman, {\it Analysis of Qualitative Data, Volume 2, New Developments.\/}
The classic on the math of analysis of variance:
Henry Scheff\'e, {\it Analysis of Variance.\/}
Broadly, in all of this, we are trying to analyze data on, call it, several variables, make predictions, etc.
In all of this, and as in the title
{\it The Analysis of Cross-Classified Data.\/}
above, suppose we have in mind random variables Y and X.
What is a random variable? Go outside. Measure something. Call that the value of random variable Y. What you measured was one value of possibly many that you might have measured. Considering all those possible values, there is a cumulative distribution F_Y(y) that for any real number y we have the probability that random variable Y is <= real number y
P(Y <= y) = F_Y(y).
So, F_Y(y) is defined for all real numbers y, is at 0 at the limit of y at minus infinity and at 1 at the limit of y at plus infinity. So, as we move real number y from left to right, F_Y(y) increases -- monotonically. On a nice day, function F_Y is differentiable, and with the derivative from calculus
f_Y(y) = d/dy F_Y(y)
and is the probability density of real random variable Y.
Here's the standard way to discover something about Y, in particular about its cumulative distribution F_Y:
We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent of Y and that have the same cumulative distribution as Y. Then for positive integer n, and for real number y, by the law of large numbers (the weak version has an easy proof), in the limit as n grows large, as accurately as we please, the fraction of the values
Y_1, Y_2, ...,Y_n
that are <= y
is F_Y(y). So, via such simple random sampling, for any real number y we can estimate F_Y(y), the cumulative distribution of Y.
For a little more, under meager assumptions that hold nearly universally in practice, if we take the ordinary grade school average of
Y_1, Y_2, ..., Y_n
as n increases we will approximate the average or expected value of Y denoted by E[Y].
To define the expected value, we can use some calculus and the cumulative distribution of F_Y, but for now let's just use our intuition about averages and move along.
Now suppose we are also given random variable X. Maybe the values of X are just real numbers, some 10 real numbers, 20 values, from set {1, 2, 3}, the last three weeks 100 times a second of the NYSE price of Microsoft, or full details on the atmosphere of earth every microsecond for the past 5 billion years. That is, for the values of X we can accept a lot of generality. Still more generality is possible, but that would take us on a detour for a while.
For our point here, let's assume that X takes on just discrete values or we have just rounded off the values and forced them to be discrete. In practice we will have only finitely many discrete values.
Now we want to use X to predict Y.
So, sure, much of machine learning is to construct a model, maybe with regression trees or neural networks, to make this prediction, but here we will show a simpler way that is always the most accurate possible whenever we have enough data. How 'bout that!
This simpler way is just old cross tabulation.
Net, over a wide range of real cases trying to predict Y from X, we should just use cross tabulation unless we don't have enough data. Or, the main reason for just empirical curve fitting using regression linear models or neural network continuous models is that we don't have enough data just to use cross tabulation.
For a preview of a coming attraction, will notice in nearly all of regression and neural networks big concerns about over fitting. Well, cross tabulation doesn't have that problem. How 'bout that!