Hacker News new | ask | show | jobs
by justk 709 days ago
The math is correct, but I think the model used is not correct since it doesn't reflect that the variable s is dichotomous so rather a mixed model should be used. If we continue thinking that s is continuous we could think of this example: s=state is encoded as a continuous variable between -1 and 1 here people change state frequently and -1 reflects the person will vote in the blue state with probability 1 and s=1 that the person will vote in the red state with probability 1 while s=0 means that the person has the same probability of voting in the red or blue states. When s is near zero the model is not able to predict the preferences of the voter and this is the reason of the low predictive power of this model for a continuous s. The extreme cases s=-1 or s=1 could be rare for populations that move from one state to the other frequently so the initial intuition is misleaded to this paradox.
2 comments

This.

R2 is not the correct measure to use.

This article is a perfect example of the principle that simply doing math and getting results is not necessarily meaningful.

R² is a measure like any other. In this case it measures the relative reduction in MSE - which is low because the prediction of individual votes remains quite bad even if the state is taken into account.

Does another measure give substantially different results?

I think that you are using here a different definition of R^2 for example the way you are thinking of R^2 doesn't allow for an interpretation of the constant term used in the linear model for the formula of the R^2 to be true. What you are thinking is R^2 = 1 - mean(the variance in each state)/(total variance), but that is not the definition of R^2 for a linear model.

As the user fskfsk.... says in another comment, here the constant term explains a lot of the variance so that the slope terms contains less information, that is not available using your definition or idea of R^2

> I think that you are using here a different definition of R^2

Different from what?

According to wikipedia:

The most general definition of the coefficient of determination is R^2 = 1 - SS_res / SS_tot ( = 1 - 0.2475 / 0.25 = 0.01 in this case)

Edit to clarify the definition above:

SS_res is the sum of squares of residuals (also called the residual sum of squares) ∑( y_i - predicted_i )^2

SS_tot is the total sum of squares (proportional to the variance of the data) ∑( y_i - ∑y_i/N )^2

The most general definition of R^2 can produce a result that is negative, and we are talking about a paradox related to values of R^2 that one should expect. So it is common to use linear models and linear regression. I don't know if the variance of the total population can be computed as the sum of the variances in each state, and state is not a continuous variable.

The population variance is the sum of the Between Group Variance and the Within Group Variance weighted by the number of elements in each group.

I don't understand what you mean. I'll just note that the value of R² in this case is 1% as the blog post explains and the code below confirms.

  > data <- data.frame(state = rep(c(0, 1), each=20), pref = c(rep(0, 11), rep(1, 9), rep(0, 9), rep(1, 11)))
  > summary(lm(pref ~ state, data = data))$r.squared
  0.01
A mixed model is not relevant here. A simple linear regression with one variable will achieve exactly the same results. Coding it as -1 and 1 has no difference to coding it as 0 and 1. You just stuff the rest into the intercept.

You would also want to be predicting 0.45 and 0.55 not 1 and 0 because we solve for squared error.