Hacker News new | ask | show | jobs
by naftaliharris 3356 days ago
A less popular but perhaps more influential phenomenon is Stein's Paradox [1]. Here's a provocative example often given to illustrate it: Say you have a baseball player, soccer player, and football player, and you wish to estimate the true mean number of home runs, goals, and touchdowns each scores per year. If you have their last ten seasons worth of data for each, then the obvious thing to do, for each player, is to estimate the true yearly mean score for each player by their average yearly score from the last ten years. (E.g., the baseball player hits an average of 20 home runs each year, so let's estimate their true mean yearly home runs by 20). Stein's Paradox says that you can actually do a lot better than this.

Even more crazy, the James-Stein Estimator which does this actually uses data about the football player and soccer player to make predictions about the baseball player, (and vice-versa). This is deeply unintuitive to most people since the players aren't related to each other at all. The phenomenon only holds with at least three players; it doesn't work for two.

(More generally, Stein's Paradox is the fact that if you have p >= 3 independent Gaussians with a known variance, you can do better in estimating their p-dimensional mean than just using their sample means).

I've spent a bunch of time trying to understand why this actually works [2]; to be honest I still don't deeply understand. But nonetheless the consensus is that the same shrinkage phenomenon is what causes improved performance for a variety of high-dimensional estimators, (lasso or ridge regression, e.g.), making the paradox very very influential.

[1] https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator [2] https://www.naftaliharris.com/blog/steinviz/

3 comments

Not sure I understand. Why should the number of home runs per year, the number of goals per year, and the number of touchdowns per year have equal variance?
You probably wouldn't expect them to. But the same kind of Stein phenomenon holds under a much broader set of conditions, including arbitrary covariance matrices and arbitrary quadratic loss, (see e.g. [1]). It's a very general phenomenon!

[1] https://projecteuclid.org/euclid.aos/1176345691

I'm going to go out on a limb with my intuition, and hypothesize that the underlying premise of the method zeroes in on a common characteristic of games humans tend to enjoy.

A universe of people wouldn't find these games generally interesting, if they didn't present outcomes above a certain threshold of unexpectedness. The underlying rules of each game are tuned into the equipment used, and a balance is reached, where game play is fair, but still requires players to develop skills.

Because each sport adheres to the premise of capturing interest in players and spectators, they all present the same scoring tendencies, when aggregating and generalizing.

If you change the motive of the activity (mix games with non-games), and the artifact that represents success (mix freely tallied points with rare physical tokens or discovered evidence), so that the behaviors being compared are dissimilar, the predictions will become unreliable.

For example, when comparing "victories" across lawyers, geological prospectors, and sports players you probably would not be able to make predictions about all, by lumping each area's statistics together. A gold-mining prospector probably wouldn't encounter success in the same way a trial lawyer would, and neither would help you predict or generalize a hockey game.

But, an oil driller, a diamond prospector, and a gold prospector would likely compare, based on the geological goal sought. A forensic analyst, a private detective, and a trial lawyer might compare, also, based on the human factors of investigation. And, thus, so too, with sports where freely tallied points measure a player's skill at achieving an event in game play.

I don't understand. If the average of the last 10 seasons is 20 home runs, what would be a better predicted value? You are a bit short in explaining here?

Your site, and the Wiki link, is very math formular heavy. Is there an explanation for someone who forgot all his statistic courses and greek letter thingys?

This is maybe a better explanation:

https://jmanton.wordpress.com/2010/06/05/comments-on-james-s...

It's still math heavy, but there is some explanation. It's hard to explain without the math, since the math if fairly integral to it, that's why it's such an amazing discovery. My understanding is that it's saying that the variables are independent, but the measurement is not. So in the case of the athletes, it's not that home runs predicts touchdowns or goals, but that by using a Stein Estimation we would get a more accurate measure of all three in aggregate. The example used in the article is less interesting, but probably better for understanding:

For example, if i=1,...3 represents the financial cost of claims a multi-national insurance company will incur in the next year in three different countries, the company may be less concerned with estimating the values of the individual means accurately and more concerned with getting an accurate overall estimate.

> If the average of the last 10 seasons is 20 home runs, what would be a better predicted value?

You are correct, 20 is the best estimate for this single variable (or similarly for any single variable in isolation).

Only if the objective is to minimize the total MSE (Mean Square Error),

          (Ph - h)² + (Pg - g)² + (Pt - t)²
    MSE = ---------------------------------
                          3
then it pays off to bias each estimate – Ph, Pg, Pt – slightly towards zero. If any of the observed values is larger than the true value, we do improve the estimation by using a correction coefficient slightly under 1. If the observation happens to be smaller than the true value, we do make a mistake. But we make a smaller mistake when the observed value was small because it was small, than what we improve when it was large. A set of 3 independent variables is already large enough that this gamble pays off in average (in the combined total error of the 3 estimates).
Here's my intuition. Let's say you have 1000 coin flippers. They flip a coin 10 times, and none of them has any special powers, and the coin is fair. Some of them will get an equal number of heads as tails, but there's a good chance you'll get tsome who get 9 or 10 heads, and also some who get 9 or 10 tails. As the probability to get 10 heads in a row is 1/1024, if you see one or two guys how get only heads, or only tails, you will attribute that to the natural variability of the outcomes.

Now imagine that these are not coin flippers, but some guys who have some skills to do something, but the outcome has a large variability nonetheless. For example running backs in the NFL league. There are running backs (RB) who average 2 years per carry (ypc), and others who average 5. 5 ypc is stellar by the way, 4 is very good, 3 is decent, and 1 or 2 not so much. But obviously, RBs get a different yardage for each carry. Now, let's say you follow the first 4 games of the season and get the average ypc for each RB. You would like to predict for each RB the average ypc for the rest of the year. The classical statistical estimation is that the current average is the best estimator for the future average, but from the extreme example with the coin flippers above, we know that this is not quite the case. Using a bayesian estimation, we get that a better estimator is if we move the current average towards the overall mean. This is called a shrinkage or James-Stein estimator. In the case of the coin flippers, you move the average all the way to 1/2, and that estimator is correct. In the case of the running backs, you don't shrink that much, and it's a cute exercise in math to see how much you shrink if you assume some distributions around the overall ypc for RBs in the league and around the ypc of an RB given his average ypc.

If you want some further intuition, think of the Sports Illustrated curse. It was observed that NFL players who make it to the cover of the SI magazine are generally "cursed", i.e. they don't do as well after as they did before. One amusing case is the (former) New England Patriot Jonas Gray, who made the cover of SI after a phenomenal game with the Indianapolis Colts in 2014 (201 rushing yards, 4 touchdowns), but then he showed up late to work and was promptly benched for the rest of the season. Generally though, players don't do anything stupid like that, but simply "regress to the mean". That regression to the mean is what explains the shrinkage estimator, and the Stein paradox.

Well this does make a little bit of sense if after estimating things get worse for a particular case.

If MSEs were 5, 5, 100, and after they are 10, 10, 80, the total got better but the prediction for the other two got worse.

This is not why it works. The James-Stein estimator applies in the case of independently distributed Normal variables with equal variance, so the individually optimal estimators for each parameter have the same MSE.
I think it is impossible James-Stein estimator of the whole sample can outperform all of the three MLE for each particular. Not that I would know of a way to generate a particular from a joint estimator.
I think I may have misinterpreted your original comment. It's true, as you say, that in the Stein estimator some individual MSEs get worse to yield a lower overall MSE. I was focusing on your example and assuming you meant this was only possible because some individual MSEs were larger than others to begin with (under the individually optimal estimator).
Yeah, reading through the Wikipedia, it looks like this reduces the total error of the combined estimator, but the error compared to an estimator of any one single parameter could be worse. So you can combine whatever crazy parameters you want, but it's only really relevant when you have things that are associated with each other somehow, and you want to reduce the total error of estimating all of them.
> it's only really relevant when you have things that are associated with each other somehow

The proof of lower overall MSE assumes the variables are independent.