|
|
|
|
|
by SyneRyder
3906 days ago
|
|
This is great advice. One of the best things about doing AABB testing is when your two A groups & B groups don't converge, you can identify bugs in your testing procedure or measure the margin of error (since you know those groups are seeing the same thing and should be performing identically). Seeing two identical A groups with wildly different results will make you more skeptical of generic A/B results & make you more rigorous about your testing. |
|
That's not how A/B testing works. 95% confidence means you should expect a 5% false positive rate, i.e., you should expect the difference measured in an A/A test to be statistically significant 5% of the time. You'll always measure some difference, since no two random samples will be 100% identical in every regard.
The procedure you and the parent propose is tantamount to selecting 1 out of every 20 test results and discounting it for no real reason. It adds extra cost to your A/B testing without producing more reliable results.
See also: https://xkcd.com/882/
It's a different matter if you're running multiple A/A-type tests over an extended period of time to ensure that the false positive rate is actually 5%, a kind of meta-statistical test. As a sanity check this is sound, but vastly more expensive than what the OP is proposing (for example). I've never seen anyone use A/A, A/A/B, A/A/B/B, etc. tests in this way. Rather, I've only ever seen them used as you and the OP suggest: the two A buckets should be "the same" and if they aren't, the results should be thrown out.