| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by crystal_revenge 691 days ago

> And pulling granular data into a Python environment and fitting a regression is much less efficient than calculating aggregated statistics like mean and variance.

This is not true. You almost never need to perform logistic regression on individual observations. Consider that estimating a single Bernoulli rv on N observations is the same as estimate a single Binomial rv for k/N. Most common statistical software (e.g. statsmodels) will support this grouped format.

If all of our covariates a discrete categories (which is typically the case for A/B tests) then you only need to regression on the number of examples equal to the number of unique configurations of the variables.

That is if you're running an A/B test on 10 million users across 50 states and 2 variants you only need 100 observations for your final model.

1 comments

e10v_me 691 days ago

> Most common statistical software (e.g. statsmodels) will support this grouped format.

Interesting, I didn't know this about statsmodels. But maybe documentation a bit misleading: "A nobs x k array where nobs is the number of observations and k is the number of regressors". Source: https://www.statsmodels.org/stable/generated/statsmodels.gen...

I would be grateful for the references on how to apply statsmodels for solving logistic model using only aggregated statistics. Or not statsmodels. Any references will do.

link

apwheele 691 days ago

For statsmodels for the methods I am familiar with you can pass in frequency weights, https://www.statsmodels.org/stable/generated/statsmodels.gen...

So that will be a bit different than r style formula's using cbind, but yes if you only have a few categories of data using weights makes sense. (Even many of sklearn's functions allow you to pass in weights.)

I have not worked out closed form for logit regression, but for Poisson regression you can get closed form for the incident rate ratio, https://andrewpwheeler.com/2024/03/18/poisson-designs-and-mi.... So no need to use maximum likelihood at all in that scenario.

link

gatopingado 691 days ago

A logistic regression is the same as a Bernoulli regression, which is the single trial case of a Binomial regression [1].

[1] https://www.pymc.io/projects/examples/en/latest/generalized_...

link

e10v_me 691 days ago

Thank you, I'm aware of this. But I don't understand how your link answers my previous message. I was asking for example of how to fit it using only aggregated statistics (focus on "aggregated"). I'm afraid the MCMC or other Bayesian sampling algorithms are not the right examples.

link