Hacker News new | ask | show | jobs
by thanatropism 2618 days ago
> The SAT when combined with the high school GPA (HSGPA) has an adjusted correlation correlation coefficient of 0.56 with first-year GPA, meaning the combined measurement accurately predicts how a potential college applicant will perform in their first year of college 56% of the time.

Thaaat's not what "correlation" means.

2 comments

I'm summarizing for a general audience. I could say, r is " the strength of the linear relationship between two variables on a graph" but I'm not sure that helps the average person understand the connection.

If you have a better description, it's more helpful to chime in with that instead of "You're wrong!"

A better summary would be that those two quantities explain about half of the variation, not that they predict accurately half the time.

If you took a random sample of cases, half of them wouldn’t exhibit a direct relationship b/w SAT and first year GPA and half nothing (unless the data is _super_ weird). Instead, SAT would be instructive-ish in predicting first year GPA for all those cases.

Explaining half the variation, and the other half?

The point was to draw a connection for the general audience, not present the most scientifically accurate description of a relationship between two variables -- that's what the links to the research are for.

It's good to communicate for a general audience, but your presentation misleads rather than simplifies.

> meaning the combined measurement accurately predicts how a potential college applicant will perform in their first year of college 56% of the time.

"accurately predicts...56% of the time" implies that half of predictions are 'accurate', which most readers would interpret as 'correct' i.e. knowing SAT + HSGPA allows you to state FYGPA _exactly_ for about half of cases. That's not what the research you cited says. Rather, the square of the multiple correlation R (which is exactly R^2, the coefficient of determination) indicates how much of the variance in the output variable is explained by the input variables. That quantity _must_ be communicated in terms of the strength of the relationship, not accuracy for a given or share of cases as it doesn't tell us anything about a given case. One could say it tells us about 30% (0.56^2, correction from my statement above) of the information we'd need to know to perfectly predict the outcome, or that the relationship is better than random, but doesn't predict perfectly, or ...

Additionally, table 5 of the link you cited indicates the adjust correlation coefficient b/w FYGPA and the combination of HSGPA and SAT is 0.62. None of the numbers in that table are 0.56, so I'm not sure where you pulled that exact number from. I've used 0.56/56% above to be clear which quantity I'm referring to.

Uh...this is exactly what I mean. Your description is 100% scientifically accurate but probably way beyond the average reader.

Again, if you can simplify this in a way more accurate than I have, great, be my guest-- I look forward to reading it.

R^2, coefficient of determination, output variable variance, etc etc -- most readers aren't going to go that deep in the math. For those who do, like you, the links to the actual research is provided.

But so far all I see are data scientists complaining about how my description is not 100% statistically accurate without providing any alternative explanation that doesn't devolve into variance of output variables.

Again, be my guest to show me I'm wrong, but what you wrote above is not something that would be easy to understand for the general audience, IMHO.

The concerning (mis)interpretation of your statement is what I said on the third line above:

> "accurately predicts...56% of the time" implies that half of predictions are 'accurate', which most readers would interpret as 'correct' i.e. knowing SAT + HSGPA allows you to state FYGPA _exactly_ for about half of cases.

This interpretation is easy to arrive at, and clearly does not correspond to a reasonable understanding of the source, even for a general audience.

I provide two suggestions above:

> One could say it tells us about 30% of the information we'd need to know to perfectly predict the outcome, or that the relationship is better than random, but doesn't predict perfectly

No, man.

1) Your description is 0% correct; and 2) It's not a complaint. Maybe you don't know what a correlation coefficient is. That's okay -- I don't know what polyfinite rings are.

That's not summarizing. "It's the strength of the relationship" is summarizing. "The combined measurement accurately predicts how a potential college applicant will perform in their first year of college 56% of the time" is just wrong. See Anscombe's quartet for a great example of why it's just plain wrong.

https://en.wikipedia.org/wiki/Anscombe%27s_quartet

And your completely scientifically accurate but easy for the lay reader to understand description in a few simple words is...?
Isn't that the example I used?

"It's the strength of the relationship"

I happen to like:

"It's how perfectly you can fit a straight line to them."

You can be mathematically accurate without being mathematically precise. Better imprecise but correct than incorrect but precise.

If you're trying to give a quantitative lay picture of what exactly 0.56 linear correlation means, you need to still be quantitatively right, while the above are quantitative. Pictures and examples can help. "For perspective, 0.56 is about the correlation between <example> and <example>"

I'm sorry, I'm not following your description.

Saying there is a quantitative strength to a relationship is, to a regular person, meaningless. Am I .56 in love with my wife?

Can I fit in a straight line to her?

These are not good descriptions. Of course HN is full of data scientists who wildly object to oversimplifying statistical relationships -- luckily you are here to give the detailed mathematical context. But these are not simplified descriptions for a general audience.

> Saying there is a quantitative strength to a relationship is, to a regular person, meaningless.

That the average person would not understand a particular accurate description of a subject does not, in and of itself, make a completely inaccurate alternative description less wrong or even a good simplification.