|
|
|
|
|
by justk
710 days ago
|
|
Let d1 = data[state==0] and d2 = data[state==1], then var(d1$pref) = 0.26, var(d2$pref)= 0.26 and var(d$pref)= 0.256 (using R and one of your dataframes), so the intuition is that knowing the state does not give information about the preferences of the voters, so this suggests that any model based on state should give poor results and so having R^2=1 is not a big paradox in this case. There must be a formula to compute R^2 from variances both among states and inside states but anyway, when the variances inside any state are bigger that the total variance that should imply that the feature that divides the population in groups is of little value for prediction so it should have a small R^2 value. |
|
I was replying to someone who claimed that "R2 is not the correct measure to use. This article is a perfect example of the principle that simply doing math and getting results is not necessarily meaningful." I've not seen any comment from anyone getting "different results" with a different measure.
Edit: You used var(...) which includes a factor N/N-1 and doesn't give exactly the total sum of squares.
The example dataframe contains 40 observations (20 per state) and you get higher variance estimate for the subsamples than for the aggregate sample but if you put toghether a few copies of the data (for example doing "data <- rbind(data, data, data, data, data)") even the adjusted (unbiased) estimator of the variance is lower for the states.
You can calculate the "exact" values yourself doing (x-mean(x))^2 or undoing the adjustment:
> when the variances inside any state are bigger that the total varianceThey are not. But you're right in that a small difference shows that dividing the population in groups is of little value for prediction and that's why the R^2 value is small.