Hacker News new | ask | show | jobs
When correlation is better than causation (narrator.ai)
70 points by mattjstar 1764 days ago
8 comments

https://en.wikipedia.org/wiki/Abductive_reasoning

> [Abductive reasoning] starts with an observation or set of observations and then seeks the simplest and most likely conclusion from the observations. This process, unlike deductive reasoning, yields a plausible conclusion but does not positively verify it. Abductive conclusions are thus qualified as having a remnant of uncertainty or doubt, which is expressed in retreat terms such as "best available" or "most likely". One can understand abductive reasoning as inference to the best explanation.

> In the 1990s, as computing power grew, the fields of law, computer science, and artificial intelligence research spurred renewed interest in the subject of abduction.

Abductive reasoning is basically how one would formally describe 1) the practice of medicine, including diagnosis, 2) the rules for evidence in legal trials, 3) the process for generating hypotheses in science, and innumerable similar activities we undertake daily.

And for obvious reasons there's a close relationship between abductive reasoning and Bayesian statistics.

There are some good papers which take inspiration from abductive reasoning to study automatic knowledge base construction and structured inference here in case anyone is interested: https://www.cs.utexas.edu/~ml/publications/area/65/abduction
If you want to talk about bayes, you could think about causation as a prior for correlation. We can measure correlation (variates), and then we can infer something about the causation (independent variables).

"Inference to the best explanation" could mean we accept any explanation regardless of how improbable it is - as long as it best explains the data.

The bayesian idea is that we can learn something about causation if we accept uncertainty and impose "sanity constraints" (priors) on the explanation.

Without knowing the real-world mechanics of Y, we can say something like "setting X to 0.33 will increase Y, with 60% probability." It maybe impossible to learn anything else from the data.

I like this a lot. I studied Bayesian mathematic and in my opinion it’s the beat approach to solving these problems. Start with a prior and continue to update your state with measurements. This avoids a lot of the common pitfalls when doing batch ML and getting junk results
> The reality is that causality is very difficult to prove. Not only does it require a higher level of statistical rigor, it also requires A LOT of carefully collected data. Meaning you will have to wait a long time before you can make any causal claim.

Malcolm Gladwell's similar message: https://www.pushkin.fm/episode/burden-of-proof/

- A correlation between mining and lung cancer was discovered in 1918, but wasn't acted on until 1975.

- There is a correlation between football and suicide/brain damage, but it is not being acted on.

At the end of the day, a correlation analysis will never replace a causal study. So really un-intuitive relationships would likely have never been uncovered with this approach. But I believe most of the questions we're typically asking - especially in a business context - are super intuitive and, even with a big causal study, they are never that "surprising" - don't you agree?
That’s no exactly true. I think the NFL has moved past denial

https://www.espn.com/nfl/story/_/id/22603654/nfl-doctor-says...

https://www.today.com/parents/brett-favre-psa-urges-no-tackl...

Whether the game can ever be made safe is another issue.

The NFL might have moved past denial, but the NCAA and high school sports don't seem to have traveled very far along that path.

It's profoundly sad that institutions of higher learning are promoting activities that they know can cause brain damage and long term disability, just so they can make money and entertain their alumni.

There’s a lot of evidence pointing to cumulative hits being a larger predictor of CTE than concussions. Unfortunately that’s not a problem the NFL can solve so it’s been swept under the rug.

This raises a broader point about Collinearity and whether correlation is actually actionable when the feedback cycle is long. You could easily be working the problem for 20 years before you ever knew you were wrong.

If you like this, you may be interested in my episode about Cargo Cults, and the value of letting go of causality completely:

https://mattasher.com/2020/04/29/the-filter-podcast-episode-...

Tl;Dr: never, but causality is hard to establish much of the time, so sometimes we must do without. To be honest, I don't find this very convincing. Most of the insights seem pretty obvious. Like if you're working from the point of correlating totals across differently sized legs of an experiment, you're starting from a really bad place.

Personally, I'm not quite positive that I buy that causation is that hard to establish in many cases. Don't give up on that idea. One thing I would say is that if you have a strong prior reason to believe that one thing causes another thing, finding that they are strongly correlated, that can be a useful datum. Mainly the important thing is to understand the limitations of correlation to guide decisionmaking.

>More often than not, when stakeholders require "causality" to make a decision, it takes way too long so they lose patience and end up making a decision without any data at all.

And therein, I believe, lies the problem.

I think the issue is the pressure for science to produce something constantly so in today's world, correlation is causality. Whether or not you believe in deterministic laws that govern reality, correlation is often the easiest approach when looking at a difficult problem and there in lies the rise of much of probabilistic and statistical models in the face of difficulty. Not all cases, but a lot of cases. We don't want to continue trying the hard work of determining definitive casual relations, if they exist and are content with correlative relations.

As someone who grew up fascinated by science because it was science that sought and provided causal relations, I'm often disappointed about the current world of research. I'm not saying this work is easy by any means, it just seems like we often give up anymore after we pick up the low hanging fruit.

I don't think it's necessarily that scientists are avoiding the hard work to show causality. It's that the most interesting causal experiments are often unethical, or the independent variable cannot be hidden like a placebo (so the participants' bias affect the randomization), or it's simply impossible.

I'll use one example from some data I've been looking at, which is whether the covid-19 pandemic has changed how people sleep. To study this using the formal notion of causality requires asking a random 50% of people to sleep as if covid isn't happening. That's obviously both impractical and implausible.

So you can really only look at correlations. But I can show you the correlations, and I bet you will be convinced that the pandemic HAS changed peoples' sleep. Here's some charts if you can take a look: https://jeffhuang.com/covid_sleep/ but there's probably several factors that convince you that this is causal.

First is the pattern of sleep pre-covid is very stable, and feels trustworthy because it goes up and down during weekends, and holidays are visible. So the data is visibly sensitive to changes in the environment. Second, nearly every country reacts similarly when the N is separated, so even if there's some large group of people somewhere that are outliers (say, some policy by California that everyone needs to go to bed later), it would only affect that one country they are in, not each country separately the same way. Finally, the patterns of sleep post-covid are also stable with similar patterns as pre-covid, but just shifted.

I'm not sure if there's formal ways of representing these concepts, but I feel humans understand these intuitively.

Right - The point about human intuition is an important one. It's a key part of the approach that the blog talks about. When we start from intuition, we significantly improve our hypotheses about what could be correlated. And because no decision making is ever done in a vacuum, it's suggesting we lean into this and use it to our advantage instead of knocking on correlations all the time.
Unfortunately this pressure - especially in research communities - has led to a lot of p-hacking.

https://www.pnas.org/content/117/24/13386

It's similar in a business context - the "pressure" of finding a causal result (especially in situations using AB testing) lead to poor analysis practices in order to find something significant.

I think you'd find Judea Pearl's work interesting.
To be honest my understanding of math and statistics is far too basic for me to really approach this guy's work, in all likelihood. I have read some papers in this space, like the original TrueSkill paper[1], and I found them utterly impenetrable. I'm sure with sufficient practice I could learn but there are so many things to spend time on. I love the concepts and I do think that they are fascinating tools for modeling reality. I'm glad other people are developing them.

[1]: https://www.microsoft.com/en-us/research/wp-content/uploads/...

A pathway to understanding TrueSkill could be first getting an understanding of factor graphs and Minka's expectation propagation algorithm [1][2]. I only have a superficial understanding of those things, but I think I've got an understanding of things that are close enough to give some idea of a pathway:

To understand Minka's expectation propagation algorithm you might first need to get a little intuition about assumed density filtering. One way to understand assumed density filtering could be to read a few tutorials about hidden Markov models [3] or Kalman filters and try to get a feel for why and when and how people might want to approximate posterior probability distributions. It might be hard to build enough intuition without trying to actually apply the things (implement the algorithms) or prove the theory yourself, and then try to come up with your own ideas for how to improve the algorithms.

I completely agree that there are basically infinitely more things to learn than available lifetime. It helps a lot to have a concrete application or goal in mind: then you can focus on learning the tools and theory that move you closer to the goal, rather than learning bits and pieces of unrelated knowledge that don't connect together in a useful way.

[1] https://tminka.github.io/papers/ep/roadmap.html

[2] Minka's EP slide deck from his PhD defense https://tminka.github.io/papers/ep/defense.pdf

[3] Rabiner wrote a famous HMM tutorial https://courses.physics.illinois.edu/ece417/fa2017/rabiner89...

I can almost guarantee you that if you're on Hacker News, you have the prerequisites needed for Judea Pearl.

I read Causality which I understand is the more technical of their books, and everything was presented surprisingly intuitively. Sure, I had to go over some things twice, but that's to be expected when you learn something new.

If you're worried, start with one of the more pop-aimed books? You'll be fine.

(Pearl did change the way I look at causality and correlation, fundamentally for the better, so I do strongly recommend getting familiar with it. I also liked Willful Ignorance which is sort of one the same theme but also not and takes a wider approach.

I agree with this. Causality doesn't have much in the way of advanced math or statistics, but there is a fairly large volume of material to go through, which is its own challenge.

A lightweight introduction to Pearl's ideas is the epilog of his book, which is also his Turing award lecture. Here's a pdf scan, there's also video of him giving this lecture up on the internet if you prefer: http://bayes.cs.ucla.edu/BOOK-2K/causality2-epilogue.pdf

here's an arbitrary review of Pearl's recent book, the book of why: https://www.ams.org/journals/notices/201907/rnoti-p1093.pdf
I took the blog to mean something a little different. You should pursue causality but not everything is worth the work so start with a faster easier correlation analysis, then it it seems worth it you can test it to see if it behaves as expected.

Working in data, especially in a startup, we often need to make so many decisions and trying to change the culture is good but when it’s a fire then this approach would get us the furthest

starting with correlation and asserting causation is bad, starting with causation and using correlation as weak evidence is good.
Precisely (or to be really specific, starting with a suspicion of causation and using correlation as one piece of weak evidence).
I think the ideal is start with something that you think is causal, then do the discussed correlation analysis then it it looks fruitful, start testing by changing the distribution of the feature and if it is causal then you should see the kpi move in the same direction.

Does that make sense? Smaller steps to make sure you only invest if it is worth it

> Personally, I'm not quite positive that I buy that causation is that hard to establish in many cases.

Do you mean it's not hard from an analytical point of view / from a practical data gathering perspective - or both?

I've found that needing causation leads to big delays in backlogs, especially when it's required for every insight, but I'm curious if you've seen it to be different.

I sort of feel like this: correlation is not causality, but they are correlated nonetheless.
Yes right - and we can actually put that information to some use in some practical way. At least that's the goal.
Kinds of reasoning - in order of accuracy (AID):

1. Abductive: https://en.wikipedia.org/wiki/Abductive_reasoning ex: Hypothesis

2. Inductive: https://en.wikipedia.org/wiki/Inductive_reasoning ex: Generalizations

3. Deductive: https://en.wikipedia.org/wiki/Deductive_reasoning ex: Studies

I think it means that there is a high chance that causation also means correlation. You see a lot of people dying after they refuse to vaccinate themselves. Can we be sure of the causation? Maybe not 100% but better safe than sorry. More often than not.
Isn't the problem here that things are backward?

You don't prove causation, but you can disprove it when you find absence of correlation.

Observed correlation suggests causation which allows you to make a prediction. A prediction can be tested. The prediction will either be true or false based upon whether the correlation continues to hold.

This is one of the problems with A/B tests--they often don't have causation aka "Why?" "This dialog box was rearranged and gave us 15% better conversion." Um. Okay. But "Why?" If you can't answer "Why?" you don't have causation.

"We removed needing to enter a phone number and now have 15% better conversion." "Why?" is obvious in that case.

I don’t think this post is trying to use correlation to prove causation. It’s in effect saying that when you can’t be sure that there is a causal relationship between two things that you can still make some decisions.

Perfect is the enemy of good as they say.

Yeah I love this! Well said
A/B tests can tell you causation. If they are correctly randomized and the tested change doesn't induce any other unintended changes, then they can tell you that the change causes whatever are the observed changes in your target metric. The challenge is controlling all other factors.

You're correct that they can't tell you the the root cause of why your change causes a particular difference, but that's a separate issue from correlation and causation.

You can neither use correlation to prove causation nor can you use causation to prove correlation. X can cause Y but be uncorrelated to Y, and X can be correlated to Y without being caused by Y.
You can use causation to make a decision… and you can use correlation to make a decision. Each with different implications of course - but depending on the context both can be used to do the same task.
You can prove causation with an experimental study. You take a random population, split it in half and modify something for the first half, then look at the effects.