| How I learned: Early in my career, I was around DC doing mostly work in applied math and computing for US national security. No joke -- constantly the work was heavily probability, statistics, and stochastic processes. I had a good ugrad math major but no courses in any of those three subjects. So I was thrown into the deep end of the pool and was constantly struggling to understand. I did pick up a good overview and a lot of intuition. But the sources varied widely, in both the topics and the quality, over stacks of books and papers, documentation of software, etc. Lesson: At least at first, one way to learn is just to jump in at the deep end and struggle using lots of texts, references, etc. Sad Lesson: While nearly all the famous books were good, some of the books that, e.g., from the publisher, might have seemed good were not. The guy who wrote the stuff, I hope he got tenure -- can't be any other reason. Later I got an applied math Ph.D. and had a terrific course in analysis and probability. So, the analysis part was, right, basically Royden, Real Analysis and the first half of Rudin, Real and Complex Analysis. There was also some material from Oxtoby, Measure and Category (ice cream and cake dessert -- super fun stuff). The probability was right from the beginning sigma algebras, etc. So, a central topic was the Radon-Nikodym theorem and conditional expectation -- gorgeous once see it. So, there was beautiful coverage of the classic limit theorems, especially martingales. Best course of any kind I ever took in school. The prof was a star student of E. Cinlar, long at Princeton. For the course, the main texts in probability were from J. Neveu, L. Breiman, K. Chung, M. Loeve. For statistics, for the applied stuff, I just remember the stacks of books I worked with early on, especially multivariate statistics. For the math, I just regard that as applied probability and sometimes just do my own derivations, sometimes at least a little new. I never found a statistics book I like or can recommend as the single, main book, e.g., like Rudin in analysis or Neveu in probability. All I can suggest is just to dig into the stacks of the most famous books and also glance at some of the software documentation. I suspect that there is a really good statistics book to be written, and maybe someone has written it, or is writing it, but I haven't seen it. Here is a simple derivation I typed in yesterday with an intuitive result in statistics that maybe people should keep in mind. In a sense this little derivation shows the strongest possible result in statistical estimation is, and may I have the envelope please [drum roll], and the discrete data version of the winner is just cross tabulation, assuming that have enough data. The context is a person applying for
credit. Might proceed similarly for, say,
ad targeting, etc. We assume that Y is a real valued random
variable where E[Y^2], that is, the expectation, of Y^2 is finite --
meager assumption, especially for
practice. The Y is something about credit
worthiness, e.g., loss on a
loan, we are interested in. We assume that X is a random variable
taking possibly very general values, e.g.,
a credit history at uncountably infinitely
many points in time in the past. We
assume that we have the value of X --
that's our credit data on the person. Let's do a little preliminary derivation:
What value of real number a minimizes E[(Y - a)^2] Well, we have E[(Y - a)^2] = E[Y^2 - 2 Ya + a^2] = E[Y^2] - 2aE[Y] + a^2 = E[Y^2] + E[Y]^2 - 2aE[Y] + a^2 - E[Y]^2 = E[Y^2] + (E[Y] - a)^2 - E[Y]^2 which we minimize with a = E[Y]. Or, for one interpretation, the minimum
rotational moment of inertia is for
rotation about the center of mass. So, for our main concern, suppose we want
to use the data we have X to approximate
Y. So, we want real valued function f with
domain the possible values of X so that
f(X) approximates Y. For the most accurate approximation, we
want to minimize E[(Y - f(X))]^2 Claim: For f(X) we want f(X) = E[Y|X] So, f(X), using X, is the best non-linear least squares
approximation to Y. Proof: We start by using one of the properties of
conditional expectation and then continue
with just simple algebra: E[(Y - f(X))^2] = E[ E[Y^2 - 2Yf(X) + f(X)^2|X] ] = E[ E[Y^2|X] - 2f(X)E[Y|X] + f(X)^2 ] = E[ E[Y^2|X] E[Y|X]^2 - 2f(X)E[Y|X] + f(X)^2 - E[Y|X]^2 ] = E[ E[Y^2|X] + (E[Y|X] - f(X))^2 - E[Y|X]^2 ] which is minimized with f(X) = E[Y|X] Done. |