Hacker News new | ask | show | jobs
by kgwgk 1918 days ago
> I've always learned of Simpson's paradox as relating more to different sample sizes when partitioning data

When you look at proportions based on binary outcomes it may be related to imbalanced groups but it's more general than that.

In the context discussed here of correlations between continous variables the groups can be of similar size.

See for example the chart here: https://towardsdatascience.com/simpsons-paradox-d2f4d8f08d42

1 comments

Interesting, I only ever heard of Simpson's paradox in the context of comparing overall averages versus subgroup averages.

I guess this paradox could then be thought of as a special case of Simpson's paradox? Since the out group will exclude people with both traits there should also be a negative correlation there, which disappears in the overall population. But in Berkson's case it seems they're implying the subgroup correlation is spurious whereas with Simpson's it could go either way.

> Since the out group will exclude people with both traits there should also be a negative correlation there

Not necessarily. Imagine the traits are distributed uniformly and independently in [-1 1]. There is no correlation:

    ******
    ******
    ******
    ******
    ******
    ******
If you select people with at least one positive trait you will find negative correlation in the group + but the correlation will still be zero in the group -.

    ++++++
    ++++++
    ++++++
    ---+++
    ---+++
    ---+++
Makes sense, I was picturing more of a diagonal boundary but you're right the paradox doesn't specify the shape of the boundary. Thanks!