| HN Mirror

> since you know those groups are seeing the same thing and should be performing identically

That's not how A/B testing works. 95% confidence means you should expect a 5% false positive rate, i.e., you should expect the difference measured in an A/A test to be statistically significant 5% of the time. You'll always measure some difference, since no two random samples will be 100% identical in every regard.

The procedure you and the parent propose is tantamount to selecting 1 out of every 20 test results and discounting it for no real reason. It adds extra cost to your A/B testing without producing more reliable results.

See also: https://xkcd.com/882/

It's a different matter if you're running multiple A/A-type tests over an extended period of time to ensure that the false positive rate is actually 5%, a kind of meta-statistical test. As a sanity check this is sound, but vastly more expensive than what the OP is proposing (for example). I've never seen anyone use A/A, A/A/B, A/A/B/B, etc. tests in this way. Rather, I've only ever seen them used as you and the OP suggest: the two A buckets should be "the same" and if they aren't, the results should be thrown out.