Hacker News new | ask | show | jobs
by mbrubeck 6035 days ago
Regression to the mean does not imply that the first slope should be less than one.

If for some reason only the above-average students regressed, then the slope would be <1. But regression to the mean also affects the scores of students who started below average; as a group we should expect them to regress upward toward the mean. Combine the two groups, and the effects exactly cancel out, leaving a slope of 1.

(Since you say the slope "should be" one, I assume the scores are normalized somehow so that the mean score for exam A is the same as the mean for exam B.)

2 comments

Not only is it true that the first slope doesn't have to be less than one, it could plausibly be greater than one.

Suppose the course material is really cumulative, so that some students "get it" and take off, while other students fall by the wayside. Then scoring well on the first test predicts scoring well on the second midterm, while scoring poorly on the first predicts scoring really badly on the second. Then the slope of your least-squares-fit line could easily be greater than 1.

In other words, the mean could stay the same, due to above-average students (on the first test) getting better, and below-average students getting worse. There's no reason to suppose that below-average students will magically get better.

Yep, I assumed they were normalized. I don't see the canceling out, though.

If you do poorly on the first test, the x-coordinate is low/close to y-axis. You're then expected to do better on the second test, so the y-coordinate is high. This will flatten the left half of the line.

As you said, if you do well on the first test, the x-coordinate is high, and the y-coordinate is low. This will flatten the right half of the line.

Right?

I'll assume up front that thee scores on the two tests are independent, since that's the scenario in which regression to the mean applies. (It also applies if they are correlated but have some independent "noise" component, but that complicates things.)

Your mistake is here:

"If you do poorly on the first test... you're then expected to do better on the second test, so the y-coordinate is high."

Actually, under my assumption, a student who does poorly on the first test is no more or less likely than anyone else to do well on the second test. Their y-coordinate will not be "high" in absolute terms; on average, it will be the same as the mean for the whole class. The regression exists because the group has a low starting point, not because it has a high ending point. As a group the high scorers will regress to the mean, not past it. (In the case where scores are partially correlated, the group will regress toward the mean.)

For example, suppose scores are independently, uniformly distributed on both tests. Then your scatterplot will have dots distributed uniformly over its entire area - obviously this does not change if you switch axes. And yet there is regression to the mean. Divide the graph into quadrants. If you look at the right half of the plot (high scorers on first exam), you'll see that there are as many in the upper quadrant as the lower (their mean on the second exam is the class mean). Same for the left side of the plot. This isn't order-dependent; you'll find that the high-scorers on B also regress to the mean when you look at their A scores.