| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fjkdlsjflkds 756 days ago

> The R^2 used with a linear model requires a constant term, in this case the constant term or bias explains a lot about preferences (almost 50/50) so there is less information available for the slope term.

This explains the paradox, basically. When you take the null model "preference = 50%" (i.e. intercept-only model), there simply isn't much residual variance left for the linear model to explain.

That's why you get an R^2 = 1 if you use the "R^2 = rho(state, preference)^2" formula (you are ignoring the role of the intercept in explaining most of the variance, and exploiting the translation-invariance of the Pearson correlation) vs. you getting an R^2 = 0.01 when you use the (more correct) "R^2 = explained variance / total variance" formula.

TL;DR: It makes sense to get a very low R^2 when it is the intercept and not the predictor that is explaining most of the variance.

1 comments

kgwgk 756 days ago

> there simply isn't much residual variance left for the linear model to explain.

I'd say that there is still quite a lot of residual variance to explain. You need a baseline - the worst choice would be to predict 0 (or 1) and the mean squared error would be 0.5. Using 0.5 as baseline halves the mean squared error to 0.25.

link

fjkdlsjflkds 756 days ago

This, of course, will depend on how you code your variables, but if you try to fit a null, intercept-only and predictor-only model, you get this as residual variance:

> data <- data.frame(state = c(0, 1), pref = c(0.45, 0.55))

> sum(residuals(lm(pref ~ 0, data = data))^2) # null model

[1] 0.505

> sum(residuals(lm(pref ~ 1, data = data))^2) # intercept-only model

[1] 0.005

> sum(residuals(lm(pref ~ state + 0, data = data))^2) # predictor-only model

[1] 0.2025

So, it seems clear that you only get a "perfect" prediction with the full (intercept + predictor) model mostly because of the intercept (which explains (0.505-0.005)/0.505 = 0.99 = 99% of the variance).

Thus, it makes sense that the predictor is only explaining the rest (i.e. 1%) of the variance... hence, the R^2 = 0.01

link

justk 756 days ago

This is a little strange, you are using a data.frame with only two points so any linear model with two different parameters will be 100% accurate. This is the line that connect two points.

link

fjkdlsjflkds 756 days ago

An affine model, yes (as I mentioned in my comment). A linear model, no ;)

But, anyway, seems like I interpreted things incorrectly.

link

kgwgk 756 days ago

Your calculation is not directly related to the model (and associated R²) discussed in the article which are about the prediction of individual votes using the state as predictor - not state averages using the state as predictor.

Maybe I'm completely missing your point but the calculations in the blog post are, adapting your code (I think you meant mean where you wrote sum):

  > data <- data.frame(state = rep(c(0, 1), each=20), pref = c(rep(0, 11), rep(1, 9), rep(0, 9), rep(1, 11)))

  > mean(residuals(lm(pref ~ 0, data = data))^2) # null model [NOT IN THE BLOG POST]
  [1] 0.5

  > mean(residuals(lm(pref ~ 1, data = data))^2) # BASELINE intercept-only model
  [1] 0.25

  > mean(residuals(lm(pref ~ state + 0, data = data))^2) # predictor-only model [NOT IN THE BLOG POST]
  [1] 0.34875

  > mean(residuals(lm(pref ~ state, data = data))^2) # MODEL
  [1] 0.2475

  > summary(lm(pref ~ state, data = data))$r.squared # MODEL
  0.01

The blog post is about what you call "intercept-only" model (MSE 0.25) and the full model (MSE 0.2475), the R² is (0.25-0.2475)/0.25=0.01. His calculation is slightly different: instead of 0.25-0.2475 he calculates directly 0.05^2 which is the variance of the predictions (in this case the total variance 0.25 can be decomposed as the variance of the errors 0.2475 plus the variance of the predictions 0.0025).

link

fjkdlsjflkds 756 days ago

(After re-reading the blog post with more care...) you are right, and thanks for the correction.

Either way, the point stands... the improvement in using a full linear model (that predicts 0.45 or 0.55, depending on state) is marginal compared to the baseline model that always predicts 0.50, as you demonstrate with your code.

To me, this doesn't seem paradoxical... the predictor is indeed providing little information over the "let's flip a coin to predict someone's voting preference" null/baseline predictor, since people's preferences (in aggregate) are almost equivalent to "flipping a coin".

note: I meant "sum", but it's the same, since the ratio between sums of squares is equivalent to the ratio between mean squares

link

kgwgk 756 days ago

> Either way, the point stands... the improvement in using a full linear model (that predicts 0.45 or 0.55, depending on state) is marginal compared to the baseline model that always predicts 0.50

Yes, I think we don't disagree. I was just puzzled by the "little variance left to explain" remark.

> note: I meant "sum", but it's the same, since the ratio between sums of squares is equivalent to the ratio between mean squares

You're right, sum of squares made sense if it was just for the ratio.

link

fjkdlsjflkds 756 days ago

Thanks for taking the time to clarify my confusion.

It's not that there is "little variance left to explain", but actually that (no matter what) there will always be too much variance left to be explained, when the response is Bernoulli-distributed and the parameter is not too far from 0.5 (i.e., the data generating process is like flipping a slightly loaded coin).

If you use the expected value to predict the Bernoulli variable, you will always be somewhat wrong (0.45 and 0.55 are both far from 0 and from 1, which are the only possible responses).

If you use a binary response to predict, you will quite often be very wrong, even if you are right on average, and even if your prediction is to generate Bernoulli-distributed samples from the exact same distribution (i.e., you know exactly how the coin is loaded/biased and you can exactly replicate its data generation process).

So... yeah... no "paradox" ;)

link