| Part I A relatively good, introductory text on
statistics, starting without either
calculus or linear algebra, with nice
progress into experimental design and
analysis of variance, that is,
multivariate statistics with discrete
data, from some serious experts in that
field, from University of Iowa, for some
serious reasons, how to maximize corn
yields considering soil chemistry, water,
seed variety, plowing techniques,
fertilizer, etc., long commonly used for
undergraduates in the social sciences,
e.g., educational statistics, agriculture,
etc. is (with my mark-up for TeX): George W.\ Snedecor and William G.\
Cochran, {\it Statistical Methods, Sixth
Edition.\/} A good, first, no toy, text on regression,
with minimal prerequisites: N.\ R.\ Draper and H.\ Smith, {\it Applied
Regression Analysis.\/} Some good books on regression and its
usual generalizations, e.g., IIRC, factor
analysis, i.e., principle components,
discriminate analysis, etc.: Maurice M.\ Tatsuoka, {\it Multivariate
Analysis: Techniques for Educational and
Psychological Research.\/} Donald F.\ Morrison, {\it Multivariate
Statistical Methods: Second Edition.\/} William W.\ Cooley and Paul R.\ Lohnes,
{\it Multivariate Data Analysis.\/} A mathematically relatively serious text
on regression: C.\ Radhakrishna Rao, {\it Linear
Statistical Inference and Its
Applications:\ \ Second Edition.\/} For multivariate statistics with discrete
data, consider George W.\ Snedecor and William G.\
Cochran, {\it Statistical Methods, Sixth
Edition.\/} Stephen E.\ Fienberg, {\it The Analysis of
Cross-Classified Data.\/} Yvonne M.\ M.\ Bishop, Stephen E.\
Fienberg, Paul W.\ Holland, {\it Discrete
Multivariate Analysis:\ \ Theory and
Practice.\/} Shelby J.\ Haberman, {\it Analysis of
Qualitative Data, Volume 1, Introductory
Topics.\/} Shelby J.\ Haberman, {\it Analysis of
Qualitative Data, Volume 2, New
Developments.\/} The classic on the math of analysis of
variance: Henry Scheff\'e, {\it Analysis of
Variance.\/} Broadly, in all of this, we are trying to
analyze data on, call it, several
variables, make predictions, etc. In all of this, and as in the title {\it The Analysis of Cross-Classified
Data.\/} above, suppose we have in mind random
variables Y and X. What is a random variable? Go outside.
Measure something. Call that the value of
random variable Y. What you measured was
one value of possibly many that you might
have measured. Considering all those
possible values, there is a cumulative
distribution F_Y(y) that for any real
number y we have the probability that
random variable Y is <= real number y P(Y <= y) = F_Y(y). So, F_Y(y) is defined for all real numbers
y, is at 0 at the limit of y at minus
infinity and at 1 at the limit of y at
plus infinity. So, as we move real number
y from left to right, F_Y(y) increases --
monotonically. On a nice day, function
F_Y is differentiable, and with the
derivative from calculus f_Y(y) = d/dy F_Y(y) and is the probability density of real
random variable Y. Here's the standard way to discover
something about Y, in particular about its
cumulative distribution F_Y: We can imagine having random variables
Y_1, Y_2, ... that are, in the sense of
probability, independent of Y and that
have the same cumulative distribution as
Y. Then for positive integer n, and for
real number y, by the law of large numbers
(the weak version has an easy proof), in
the limit as n grows large, as accurately
as we please, the fraction of the values Y_1, Y_2, ...,Y_n that are <= y is F_Y(y). So, via such simple random
sampling, for any real number y we can
estimate F_Y(y), the cumulative
distribution of Y. For a little more, under meager
assumptions that hold nearly universally
in practice, if we take the ordinary grade
school average of Y_1, Y_2, ..., Y_n as n increases we will approximate the
average or expected value of Y denoted
by E[Y]. To define the expected value, we can use
some calculus and the cumulative
distribution of F_Y, but for now let's
just use our intuition about averages and
move along. Now suppose we are also given random
variable X. Maybe the values of X are just
real numbers, some 10 real numbers, 20
values, from set {1, 2, 3}, the last three
weeks 100 times a second of the NYSE price
of Microsoft, or full details on the
atmosphere of earth every microsecond for
the past 5 billion years. That is, for
the values of X we can accept a lot of
generality. Still more generality is
possible, but that would take us on a
detour for a while. For our point here, let's assume that X
takes on just discrete values or we have
just rounded off the values and forced
them to be discrete. In practice we will
have only finitely many discrete values. Now we want to use X to predict Y. So, sure, much of machine learning is to
construct a model, maybe with regression
trees or neural networks, to make this
prediction, but here we will show a
simpler way that is always the most
accurate possible whenever we have enough
data. How 'bout that! This simpler way is just old cross
tabulation. Net, over a wide range of real cases
trying to predict Y from X, we should just
use cross tabulation unless we don't have
enough data. Or, the main reason for just
empirical curve fitting using regression
linear models or neural network continuous
models is that we don't have enough data
just to use cross tabulation. For a preview of a coming attraction, will
notice in nearly all of regression and
neural networks big concerns about over
fitting. Well, cross tabulation doesn't
have that problem. How 'bout that! |
should read
"We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent and that have the same cumulative distribution as Y."
"To define the expected value, we can use some calculus and the cumulative distribution of F_Y, but for now let's just use our intuition about averages and move along."
should read
"To define the expected value, we can use some calculus and the cumulative distribution F_Y, but for now let's just use our intuition about averages and move along."