| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by justk 756 days ago
	The math is correct, but I think the model used is not correct since it doesn't reflect that the variable s is dichotomous so rather a mixed model should be used. If we continue thinking that s is continuous we could think of this example: s=state is encoded as a continuous variable between -1 and 1 here people change state frequently and -1 reflects the person will vote in the blue state with probability 1 and s=1 that the person will vote in the red state with probability 1 while s=0 means that the person has the same probability of voting in the red or blue states. When s is near zero the model is not able to predict the preferences of the voter and this is the reason of the low predictive power of this model for a continuous s. The extreme cases s=-1 or s=1 could be rare for populations that move from one state to the other frequently so the initial intuition is misleaded to this paradox.

2 comments

mtts 756 days ago

This.

R2 is not the correct measure to use.

This article is a perfect example of the principle that simply doing math and getting results is not necessarily meaningful.

link

kgwgk 756 days ago

R² is a measure like any other. In this case it measures the relative reduction in MSE - which is low because the prediction of individual votes remains quite bad even if the state is taken into account.

Does another measure give substantially different results?

link

justk 756 days ago

I think that you are using here a different definition of R^2 for example the way you are thinking of R^2 doesn't allow for an interpretation of the constant term used in the linear model for the formula of the R^2 to be true. What you are thinking is R^2 = 1 - mean(the variance in each state)/(total variance), but that is not the definition of R^2 for a linear model.

As the user fskfsk.... says in another comment, here the constant term explains a lot of the variance so that the slope terms contains less information, that is not available using your definition or idea of R^2

link

kgwgk 756 days ago

> I think that you are using here a different definition of R^2

Different from what?

According to wikipedia:

The most general definition of the coefficient of determination is R^2 = 1 - SS_res / SS_tot ( = 1 - 0.2475 / 0.25 = 0.01 in this case)

Edit to clarify the definition above:

SS_res is the sum of squares of residuals (also called the residual sum of squares) ∑( y_i - predicted_i )^2

SS_tot is the total sum of squares (proportional to the variance of the data) ∑( y_i - ∑y_i/N )^2

link

justk 756 days ago

The most general definition of R^2 can produce a result that is negative, and we are talking about a paradox related to values of R^2 that one should expect. So it is common to use linear models and linear regression. I don't know if the variance of the total population can be computed as the sum of the variances in each state, and state is not a continuous variable.

The population variance is the sum of the Between Group Variance and the Within Group Variance weighted by the number of elements in each group.

link

kgwgk 756 days ago

I don't understand what you mean. I'll just note that the value of R² in this case is 1% as the blog post explains and the code below confirms.

  > data <- data.frame(state = rep(c(0, 1), each=20), pref = c(rep(0, 11), rep(1, 9), rep(0, 9), rep(1, 11)))
  > summary(lm(pref ~ state, data = data))$r.squared
  0.01

link

jncfhnb 756 days ago

A mixed model is not relevant here. A simple linear regression with one variable will achieve exactly the same results. Coding it as -1 and 1 has no difference to coding it as 0 and 1. You just stuff the rest into the intercept.

You would also want to be predicting 0.45 and 0.55 not 1 and 0 because we solve for squared error.

link