Hacker News new | ask | show | jobs
by deugtniet 691 days ago
I guess I'm not very versed in website A/B testing, but wouldn't it be much better to analyze these results in a regression framework where you can correct for the covariates?

On top of this, logistic regression makes your units a lot more interpretable than just looking at differences in means. I.E. The odds of buying something are 1.1 when you are assigned in group B.

2 comments

This is the correct approach, but having done A/B testing for many years (and basically moved away from this area of work), nobody in the industry really cares about understanding the problem they care about prompting themselves as experts and creating the illusion of rigorous marketting.

Correct A/B testing should involved starting with an A/A test to validate the setup, building a basic causal model of what you expect the treatment impact to be, controlling of covariates, and finally ensuring that when the causal factor is controlled for the results change as expected.

But even the "experts" I've read in this area largely focus on statistical details that honestly don't matter (and if they do the change you're proposing is so small that you shouldn't be wasting time on it).

In practice if you need "statistical significance" to determine if change has made an impact on your users you're already focused on problems that are too small to be worth your time.

Ok so, that’s interesting. I like examples so are you saying I should build a “framework” that presents two (landing) pages exactly the same, and (hopefully) is able to collect things like what source the visitor came from, maybe some demographics. And I then try to get 100 impressions with random blue and red buttons, then check to see if there is some confounding factor (blue was always picked by females linking from google ads) and then remove the random next time and show blue ads to half females from google and half anyone else

I think the dumb underlying question I have is - how does one do experimental design

Edit: and if you aren’t seeing giant obvious improvements, try improving something else (I get the idea that my B is going to be so obvious that there is no need to worry about stats - if it’s not that’s a signal to chnage something else?

There exist some solutions for this that overlay your webpage, and there is a heatmap to show where a user's cursor has traveled to. More popular areas show "hotter" in red, which could show how effective your changes are, or where you may want to center content you're trying to get users to notice around. I haven't directly worked with the data, but have seen the heatmaps from Hotjar on sites I've implemented (doing both frontend and backend development, but not involved in the design or SEO/marketing).
Thank you for the interest and for the suggestion.

Yes, one can analyze A/B tests in a regression framework. In fact, CUPED is an equivalent to the linear regression with a single covariate.

Would it be better? It depends on the definition of "better". There are several factors to consider. Scientific rigor is one of them. So is the computational efficiency.

A/B tests are usually conducted at scale of thousands of randomization units (actually it's more like tens or hundreds of thousands). There are two consequences:

1. Computational efficiency is very important, especially if we take into account the number of experiments and the number of metrics. And pulling granular data into a Python environment and fitting a regression is much less efficient than calculating aggregated statistics like mean and variance.

2. I didn't check, but I'm pretty sure that, at such scale, logistic and linear regressions' results will be very close, if not equal.

And even if, for some reason, there is a real need to analyze a test using logistic model, multi-level model, or a clustered error, in tea-tasting, it's possible via custom metrics: https://tea-tasting.e10v.me/custom-metrics/

> And pulling granular data into a Python environment and fitting a regression is much less efficient than calculating aggregated statistics like mean and variance.

This is not true. You almost never need to perform logistic regression on individual observations. Consider that estimating a single Bernoulli rv on N observations is the same as estimate a single Binomial rv for k/N. Most common statistical software (e.g. statsmodels) will support this grouped format.

If all of our covariates a discrete categories (which is typically the case for A/B tests) then you only need to regression on the number of examples equal to the number of unique configurations of the variables.

That is if you're running an A/B test on 10 million users across 50 states and 2 variants you only need 100 observations for your final model.

> Most common statistical software (e.g. statsmodels) will support this grouped format.

Interesting, I didn't know this about statsmodels. But maybe documentation a bit misleading: "A nobs x k array where nobs is the number of observations and k is the number of regressors". Source: https://www.statsmodels.org/stable/generated/statsmodels.gen...

I would be grateful for the references on how to apply statsmodels for solving logistic model using only aggregated statistics. Or not statsmodels. Any references will do.

For statsmodels for the methods I am familiar with you can pass in frequency weights, https://www.statsmodels.org/stable/generated/statsmodels.gen...

So that will be a bit different than r style formula's using cbind, but yes if you only have a few categories of data using weights makes sense. (Even many of sklearn's functions allow you to pass in weights.)

I have not worked out closed form for logit regression, but for Poisson regression you can get closed form for the incident rate ratio, https://andrewpwheeler.com/2024/03/18/poisson-designs-and-mi.... So no need to use maximum likelihood at all in that scenario.

A logistic regression is the same as a Bernoulli regression, which is the single trial case of a Binomial regression [1].

[1] https://www.pymc.io/projects/examples/en/latest/generalized_...

Thank you, I'm aware of this. But I don't understand how your link answers my previous message. I was asking for example of how to fit it using only aggregated statistics (focus on "aggregated"). I'm afraid the MCMC or other Bayesian sampling algorithms are not the right examples.