When correlation is better than causation | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	When correlation is better than causation (narrator.ai)
	70 points by mattjstar 1764 days ago

8 comments

wahern 1763 days ago

https://en.wikipedia.org/wiki/Abductive_reasoning

> [Abductive reasoning] starts with an observation or set of observations and then seeks the simplest and most likely conclusion from the observations. This process, unlike deductive reasoning, yields a plausible conclusion but does not positively verify it. Abductive conclusions are thus qualified as having a remnant of uncertainty or doubt, which is expressed in retreat terms such as "best available" or "most likely". One can understand abductive reasoning as inference to the best explanation.

> In the 1990s, as computing power grew, the fields of law, computer science, and artificial intelligence research spurred renewed interest in the subject of abduction.

Abductive reasoning is basically how one would formally describe 1) the practice of medicine, including diagnosis, 2) the rules for evidence in legal trials, 3) the process for generating hypotheses in science, and innumerable similar activities we undertake daily.

And for obvious reasons there's a close relationship between abductive reasoning and Bayesian statistics.

bmc7505 1763 days ago

There are some good papers which take inspiration from abductive reasoning to study automatic knowledge base construction and structured inference here in case anyone is interested: https://www.cs.utexas.edu/~ml/publications/area/65/abduction

yobbo 1763 days ago

If you want to talk about bayes, you could think about causation as a prior for correlation. We can measure correlation (variates), and then we can infer something about the causation (independent variables).

"Inference to the best explanation" could mean we accept any explanation regardless of how improbable it is - as long as it best explains the data.

The bayesian idea is that we can learn something about causation if we accept uncertainty and impose "sanity constraints" (priors) on the explanation.

Without knowing the real-world mechanics of Y, we can say something like "setting X to 0.33 will increase Y, with 60% probability." It maybe impossible to learn anything else from the data.

ahmedelsama 1763 days ago

I like this a lot. I studied Bayesian mathematic and in my opinion it’s the beat approach to solving these problems. Start with a prior and continue to update your state with measurements. This avoids a lot of the common pitfalls when doing batch ML and getting junk results

Leftium 1763 days ago

> The reality is that causality is very difficult to prove. Not only does it require a higher level of statistical rigor, it also requires A LOT of carefully collected data. Meaning you will have to wait a long time before you can make any causal claim.

Malcolm Gladwell's similar message: https://www.pushkin.fm/episode/burden-of-proof/

- A correlation between mining and lung cancer was discovered in 1918, but wasn't acted on until 1975.

- There is a correlation between football and suicide/brain damage, but it is not being acted on.

brittanymdavis1 1763 days ago

At the end of the day, a correlation analysis will never replace a causal study. So really un-intuitive relationships would likely have never been uncovered with this approach. But I believe most of the questions we're typically asking - especially in a business context - are super intuitive and, even with a big causal study, they are never that "surprising" - don't you agree?

melling 1763 days ago

That’s no exactly true. I think the NFL has moved past denial

https://www.espn.com/nfl/story/_/id/22603654/nfl-doctor-says...

https://www.today.com/parents/brett-favre-psa-urges-no-tackl...

Whether the game can ever be made safe is another issue.

whyenot 1763 days ago

The NFL might have moved past denial, but the NCAA and high school sports don't seem to have traveled very far along that path.

It's profoundly sad that institutions of higher learning are promoting activities that they know can cause brain damage and long term disability, just so they can make money and entertain their alumni.

negativesigma 1762 days ago

There’s a lot of evidence pointing to cumulative hits being a larger predictor of CTE than concussions. Unfortunately that’s not a problem the NFL can solve so it’s been swept under the rug.

This raises a broader point about Collinearity and whether correlation is actually actionable when the feedback cycle is long. You could easily be working the problem for 20 years before you ever knew you were wrong.

Mattasher 1763 days ago

If you like this, you may be interested in my episode about Cargo Cults, and the value of letting go of causality completely:

https://mattasher.com/2020/04/29/the-filter-podcast-episode-...

asdfasgasdgasdg 1763 days ago

Tl;Dr: never, but causality is hard to establish much of the time, so sometimes we must do without. To be honest, I don't find this very convincing. Most of the insights seem pretty obvious. Like if you're working from the point of correlating totals across differently sized legs of an experiment, you're starting from a really bad place.

Personally, I'm not quite positive that I buy that causation is that hard to establish in many cases. Don't give up on that idea. One thing I would say is that if you have a strong prior reason to believe that one thing causes another thing, finding that they are strongly correlated, that can be a useful datum. Mainly the important thing is to understand the limitations of correlation to guide decisionmaking.

Frost1x 1763 days ago

>More often than not, when stakeholders require "causality" to make a decision, it takes way too long so they lose patience and end up making a decision without any data at all.

And therein, I believe, lies the problem.

I think the issue is the pressure for science to produce something constantly so in today's world, correlation is causality. Whether or not you believe in deterministic laws that govern reality, correlation is often the easiest approach when looking at a difficult problem and there in lies the rise of much of probabilistic and statistical models in the face of difficulty. Not all cases, but a lot of cases. We don't want to continue trying the hard work of determining definitive casual relations, if they exist and are content with correlative relations.

As someone who grew up fascinated by science because it was science that sought and provided causal relations, I'm often disappointed about the current world of research. I'm not saying this work is easy by any means, it just seems like we often give up anymore after we pick up the low hanging fruit.

lazyjeff 1763 days ago

I don't think it's necessarily that scientists are avoiding the hard work to show causality. It's that the most interesting causal experiments are often unethical, or the independent variable cannot be hidden like a placebo (so the participants' bias affect the randomization), or it's simply impossible.

I'll use one example from some data I've been looking at, which is whether the covid-19 pandemic has changed how people sleep. To study this using the formal notion of causality requires asking a random 50% of people to sleep as if covid isn't happening. That's obviously both impractical and implausible.

So you can really only look at correlations. But I can show you the correlations, and I bet you will be convinced that the pandemic HAS changed peoples' sleep. Here's some charts if you can take a look: https://jeffhuang.com/covid_sleep/ but there's probably several factors that convince you that this is causal.

First is the pattern of sleep pre-covid is very stable, and feels trustworthy because it goes up and down during weekends, and holidays are visible. So the data is visibly sensitive to changes in the environment. Second, nearly every country reacts similarly when the N is separated, so even if there's some large group of people somewhere that are outliers (say, some policy by California that everyone needs to go to bed later), it would only affect that one country they are in, not each country separately the same way. Finally, the patterns of sleep post-covid are also stable with similar patterns as pre-covid, but just shifted.

I'm not sure if there's formal ways of representing these concepts, but I feel humans understand these intuitively.

brittanymdavis1 1763 days ago

Right - The point about human intuition is an important one. It's a key part of the approach that the blog talks about. When we start from intuition, we significantly improve our hypotheses about what could be correlated. And because no decision making is ever done in a vacuum, it's suggesting we lean into this and use it to our advantage instead of knocking on correlations all the time.

brittanymdavis1 1763 days ago

Unfortunately this pressure - especially in research communities - has led to a lot of p-hacking.

https://www.pnas.org/content/117/24/13386

It's similar in a business context - the "pressure" of finding a causal result (especially in situations using AB testing) lead to poor analysis practices in order to find something significant.

jasonwatkinspdx 1763 days ago

I think you'd find Judea Pearl's work interesting.

asdfasgasdgasdg 1763 days ago

To be honest my understanding of math and statistics is far too basic for me to really approach this guy's work, in all likelihood. I have read some papers in this space, like the original TrueSkill paper[1], and I found them utterly impenetrable. I'm sure with sufficient practice I could learn but there are so many things to spend time on. I love the concepts and I do think that they are fascinating tools for modeling reality. I'm glad other people are developing them.

[1]: https://www.microsoft.com/en-us/research/wp-content/uploads/...

shoo 1763 days ago

A pathway to understanding TrueSkill could be first getting an understanding of factor graphs and Minka's expectation propagation algorithm [1][2]. I only have a superficial understanding of those things, but I think I've got an understanding of things that are close enough to give some idea of a pathway:

To understand Minka's expectation propagation algorithm you might first need to get a little intuition about assumed density filtering. One way to understand assumed density filtering could be to read a few tutorials about hidden Markov models [3] or Kalman filters and try to get a feel for why and when and how people might want to approximate posterior probability distributions. It might be hard to build enough intuition without trying to actually apply the things (implement the algorithms) or prove the theory yourself, and then try to come up with your own ideas for how to improve the algorithms.

I completely agree that there are basically infinitely more things to learn than available lifetime. It helps a lot to have a concrete application or goal in mind: then you can focus on learning the tools and theory that move you closer to the goal, rather than learning bits and pieces of unrelated knowledge that don't connect together in a useful way.

[1] https://tminka.github.io/papers/ep/roadmap.html

[2] Minka's EP slide deck from his PhD defense https://tminka.github.io/papers/ep/defense.pdf

[3] Rabiner wrote a famous HMM tutorial https://courses.physics.illinois.edu/ece417/fa2017/rabiner89...

kqr 1763 days ago

I can almost guarantee you that if you're on Hacker News, you have the prerequisites needed for Judea Pearl.

I read Causality which I understand is the more technical of their books, and everything was presented surprisingly intuitively. Sure, I had to go over some things twice, but that's to be expected when you learn something new.

If you're worried, start with one of the more pop-aimed books? You'll be fine.

(Pearl did change the way I look at causality and correlation, fundamentally for the better, so I do strongly recommend getting familiar with it. I also liked Willful Ignorance which is sort of one the same theme but also not and takes a wider approach.

jasonwatkinspdx 1763 days ago

I agree with this. Causality doesn't have much in the way of advanced math or statistics, but there is a fairly large volume of material to go through, which is its own challenge.

A lightweight introduction to Pearl's ideas is the epilog of his book, which is also his Turing award lecture. Here's a pdf scan, there's also video of him giving this lecture up on the internet if you prefer: http://bayes.cs.ucla.edu/BOOK-2K/causality2-epilogue.pdf

shoo 1763 days ago

here's an arbitrary review of Pearl's recent book, the book of why: https://www.ams.org/journals/notices/201907/rnoti-p1093.pdf

ahmedelsama 1763 days ago

I took the blog to mean something a little different. You should pursue causality but not everything is worth the work so start with a faster easier correlation analysis, then it it seems worth it you can test it to see if it behaves as expected.

Working in data, especially in a startup, we often need to make so many decisions and trying to change the culture is good but when it’s a fire then this approach would get us the furthest

dbt00 1763 days ago

starting with correlation and asserting causation is bad, starting with causation and using correlation as weak evidence is good.

asdfasgasdgasdg 1763 days ago

Precisely (or to be really specific, starting with a suspicion of causation and using correlation as one piece of weak evidence).

ahmedelsama 1763 days ago

I think the ideal is start with something that you think is causal, then do the discussed correlation analysis then it it looks fruitful, start testing by changing the distribution of the feature and if it is causal then you should see the kpi move in the same direction.

Does that make sense? Smaller steps to make sure you only invest if it is worth it

brittanymdavis1 1763 days ago

> Personally, I'm not quite positive that I buy that causation is that hard to establish in many cases.

Do you mean it's not hard from an analytical point of view / from a practical data gathering perspective - or both?

I've found that needing causation leads to big delays in backlogs, especially when it's required for every insight, but I'm curious if you've seen it to be different.

BalinKing 1763 days ago

I sort of feel like this: correlation is not causality, but they are correlated nonetheless.

brittanymdavis1 1763 days ago

Yes right - and we can actually put that information to some use in some practical way. At least that's the goal.

mgh2 1762 days ago

Kinds of reasoning - in order of accuracy (AID):

1. Abductive: https://en.wikipedia.org/wiki/Abductive_reasoning ex: Hypothesis

2. Inductive: https://en.wikipedia.org/wiki/Inductive_reasoning ex: Generalizations

3. Deductive: https://en.wikipedia.org/wiki/Deductive_reasoning ex: Studies

galaxyLogic 1763 days ago

I think it means that there is a high chance that causation also means correlation. You see a lot of people dying after they refuse to vaccinate themselves. Can we be sure of the causation? Maybe not 100% but better safe than sorry. More often than not.

bsder 1763 days ago

Isn't the problem here that things are backward?

You don't prove causation, but you can disprove it when you find absence of correlation.

Observed correlation suggests causation which allows you to make a prediction. A prediction can be tested. The prediction will either be true or false based upon whether the correlation continues to hold.

This is one of the problems with A/B tests--they often don't have causation aka "Why?" "This dialog box was rearranged and gave us 15% better conversion." Um. Okay. But "Why?" If you can't answer "Why?" you don't have causation.

"We removed needing to enter a phone number and now have 15% better conversion." "Why?" is obvious in that case.

led76 1763 days ago

I don’t think this post is trying to use correlation to prove causation. It’s in effect saying that when you can’t be sure that there is a causal relationship between two things that you can still make some decisions.

Perfect is the enemy of good as they say.

ahmedelsama 1763 days ago

Yeah I love this! Well said

asdfasgasdgasdg 1763 days ago

A/B tests can tell you causation. If they are correctly randomized and the tested change doesn't induce any other unintended changes, then they can tell you that the change causes whatever are the observed changes in your target metric. The challenge is controlling all other factors.

You're correct that they can't tell you the the root cause of why your change causes a particular difference, but that's a separate issue from correlation and causation.

Kranar 1763 days ago

You can neither use correlation to prove causation nor can you use causation to prove correlation. X can cause Y but be uncorrelated to Y, and X can be correlated to Y without being caused by Y.

brittanymdavis1 1763 days ago

You can use causation to make a decision… and you can use correlation to make a decision. Each with different implications of course - but depending on the context both can be used to do the same task.

elcomet 1763 days ago

You can prove causation with an experimental study. You take a random population, split it in half and modify something for the first half, then look at the effects.