|
|
|
|
|
by caddemon
1918 days ago
|
|
I think Berkson's paradox is more specific than just correlations arising from non-random sample selection. Correlations that are not representative of the general population could still be useful, if it's a meaningful correlation within some subgroup of interest. The problem is when the features you are correlating relate too closely to the features that were used for sample selection - then you can end up with a trivial result. I've always learned of Simpson's paradox as relating more to different sample sizes when partitioning data, which can happen entirely arbitrarily - for example a baseball player getting injured part way through the season. The fact that one player's at bats get partitioned differently than another's is not caused by the on field performance, so there's no "double dipping" going on like I would imagine with Berkson's. Conversely I'm having trouble fitting a Berkson's example into the framework of Simpson's paradox, since there's no reason the poorly-selected subpopulation can't theoretically be exactly half of the general population. And if all of the samples are of equal size Simpson's paradox doesn't exist anymore (because with equal bin sizes the mean of means is equivalent to the overall mean). |
|
When you look at proportions based on binary outcomes it may be related to imbalanced groups but it's more general than that.
In the context discussed here of correlations between continous variables the groups can be of similar size.
See for example the chart here: https://towardsdatascience.com/simpsons-paradox-d2f4d8f08d42