Hacker News new | ask | show | jobs
by asdfasgasdgasdg 1763 days ago
Tl;Dr: never, but causality is hard to establish much of the time, so sometimes we must do without. To be honest, I don't find this very convincing. Most of the insights seem pretty obvious. Like if you're working from the point of correlating totals across differently sized legs of an experiment, you're starting from a really bad place.

Personally, I'm not quite positive that I buy that causation is that hard to establish in many cases. Don't give up on that idea. One thing I would say is that if you have a strong prior reason to believe that one thing causes another thing, finding that they are strongly correlated, that can be a useful datum. Mainly the important thing is to understand the limitations of correlation to guide decisionmaking.

5 comments

>More often than not, when stakeholders require "causality" to make a decision, it takes way too long so they lose patience and end up making a decision without any data at all.

And therein, I believe, lies the problem.

I think the issue is the pressure for science to produce something constantly so in today's world, correlation is causality. Whether or not you believe in deterministic laws that govern reality, correlation is often the easiest approach when looking at a difficult problem and there in lies the rise of much of probabilistic and statistical models in the face of difficulty. Not all cases, but a lot of cases. We don't want to continue trying the hard work of determining definitive casual relations, if they exist and are content with correlative relations.

As someone who grew up fascinated by science because it was science that sought and provided causal relations, I'm often disappointed about the current world of research. I'm not saying this work is easy by any means, it just seems like we often give up anymore after we pick up the low hanging fruit.

I don't think it's necessarily that scientists are avoiding the hard work to show causality. It's that the most interesting causal experiments are often unethical, or the independent variable cannot be hidden like a placebo (so the participants' bias affect the randomization), or it's simply impossible.

I'll use one example from some data I've been looking at, which is whether the covid-19 pandemic has changed how people sleep. To study this using the formal notion of causality requires asking a random 50% of people to sleep as if covid isn't happening. That's obviously both impractical and implausible.

So you can really only look at correlations. But I can show you the correlations, and I bet you will be convinced that the pandemic HAS changed peoples' sleep. Here's some charts if you can take a look: https://jeffhuang.com/covid_sleep/ but there's probably several factors that convince you that this is causal.

First is the pattern of sleep pre-covid is very stable, and feels trustworthy because it goes up and down during weekends, and holidays are visible. So the data is visibly sensitive to changes in the environment. Second, nearly every country reacts similarly when the N is separated, so even if there's some large group of people somewhere that are outliers (say, some policy by California that everyone needs to go to bed later), it would only affect that one country they are in, not each country separately the same way. Finally, the patterns of sleep post-covid are also stable with similar patterns as pre-covid, but just shifted.

I'm not sure if there's formal ways of representing these concepts, but I feel humans understand these intuitively.

Right - The point about human intuition is an important one. It's a key part of the approach that the blog talks about. When we start from intuition, we significantly improve our hypotheses about what could be correlated. And because no decision making is ever done in a vacuum, it's suggesting we lean into this and use it to our advantage instead of knocking on correlations all the time.
Unfortunately this pressure - especially in research communities - has led to a lot of p-hacking.

https://www.pnas.org/content/117/24/13386

It's similar in a business context - the "pressure" of finding a causal result (especially in situations using AB testing) lead to poor analysis practices in order to find something significant.

I think you'd find Judea Pearl's work interesting.
To be honest my understanding of math and statistics is far too basic for me to really approach this guy's work, in all likelihood. I have read some papers in this space, like the original TrueSkill paper[1], and I found them utterly impenetrable. I'm sure with sufficient practice I could learn but there are so many things to spend time on. I love the concepts and I do think that they are fascinating tools for modeling reality. I'm glad other people are developing them.

[1]: https://www.microsoft.com/en-us/research/wp-content/uploads/...

A pathway to understanding TrueSkill could be first getting an understanding of factor graphs and Minka's expectation propagation algorithm [1][2]. I only have a superficial understanding of those things, but I think I've got an understanding of things that are close enough to give some idea of a pathway:

To understand Minka's expectation propagation algorithm you might first need to get a little intuition about assumed density filtering. One way to understand assumed density filtering could be to read a few tutorials about hidden Markov models [3] or Kalman filters and try to get a feel for why and when and how people might want to approximate posterior probability distributions. It might be hard to build enough intuition without trying to actually apply the things (implement the algorithms) or prove the theory yourself, and then try to come up with your own ideas for how to improve the algorithms.

I completely agree that there are basically infinitely more things to learn than available lifetime. It helps a lot to have a concrete application or goal in mind: then you can focus on learning the tools and theory that move you closer to the goal, rather than learning bits and pieces of unrelated knowledge that don't connect together in a useful way.

[1] https://tminka.github.io/papers/ep/roadmap.html

[2] Minka's EP slide deck from his PhD defense https://tminka.github.io/papers/ep/defense.pdf

[3] Rabiner wrote a famous HMM tutorial https://courses.physics.illinois.edu/ece417/fa2017/rabiner89...

I can almost guarantee you that if you're on Hacker News, you have the prerequisites needed for Judea Pearl.

I read Causality which I understand is the more technical of their books, and everything was presented surprisingly intuitively. Sure, I had to go over some things twice, but that's to be expected when you learn something new.

If you're worried, start with one of the more pop-aimed books? You'll be fine.

(Pearl did change the way I look at causality and correlation, fundamentally for the better, so I do strongly recommend getting familiar with it. I also liked Willful Ignorance which is sort of one the same theme but also not and takes a wider approach.

I agree with this. Causality doesn't have much in the way of advanced math or statistics, but there is a fairly large volume of material to go through, which is its own challenge.

A lightweight introduction to Pearl's ideas is the epilog of his book, which is also his Turing award lecture. Here's a pdf scan, there's also video of him giving this lecture up on the internet if you prefer: http://bayes.cs.ucla.edu/BOOK-2K/causality2-epilogue.pdf

here's an arbitrary review of Pearl's recent book, the book of why: https://www.ams.org/journals/notices/201907/rnoti-p1093.pdf
I took the blog to mean something a little different. You should pursue causality but not everything is worth the work so start with a faster easier correlation analysis, then it it seems worth it you can test it to see if it behaves as expected.

Working in data, especially in a startup, we often need to make so many decisions and trying to change the culture is good but when it’s a fire then this approach would get us the furthest

starting with correlation and asserting causation is bad, starting with causation and using correlation as weak evidence is good.
Precisely (or to be really specific, starting with a suspicion of causation and using correlation as one piece of weak evidence).
I think the ideal is start with something that you think is causal, then do the discussed correlation analysis then it it looks fruitful, start testing by changing the distribution of the feature and if it is causal then you should see the kpi move in the same direction.

Does that make sense? Smaller steps to make sure you only invest if it is worth it

> Personally, I'm not quite positive that I buy that causation is that hard to establish in many cases.

Do you mean it's not hard from an analytical point of view / from a practical data gathering perspective - or both?

I've found that needing causation leads to big delays in backlogs, especially when it's required for every insight, but I'm curious if you've seen it to be different.