Hacker News new | ask | show | jobs
by sdrinf 1294 days ago
Not gp, but there's a significant kink when this applies to humans; namely, that humans have the ability to reflect on publicly known outcomes, and change their behavior en-masse in light of information so gained.

I put this earler in the phrase "reflection completeness": https://sdrinf.com/reflection-completeness ie there are things which stops working when people know about it.

In particular with A/B testing, this means that the initial A/B test is intermingled from at least 3 effects: specifically it measures how the naive population's behavior changes as a function of new functionality being made available. This is heavily, heavily time-dependent; specifically there's a "novelty effect" (early data collection will not be representative to long-term usage patterns); and there's "reflection effect" (once the outcome of the test is widely known, people can change their behavior based on that). Controlling for the first is difficult, but possible; controlling for the second, beyond just "keeping everything secret", is significantly more so, as the timelines for that might be years in length.

I strongly suspect GP was pointing at this timeline factor, and specifically that market engineering, as currently, generally, widely practiced, is grounded on the immediately available signal of "does it increases sales in 2 weeks of A/B test running". Which, given novelty effects, is heavily biased towards "yes"; and these people aren't incentivized (nor have the time/energy) to measure _very_ long-term effects beyond novelty, and reflection period.

3 comments

I agree that it can be a difficult thing to analyze. There's also the Hawthorne Effect at play here too. But those are just confounding variables, they do not negate the fact that A/B tests are still "real science".

An A/B test just refers to observing how a dependent variable changes when an independent variable is in two different states, State A and State B.

Drug vs placebo - is an A/B test.

Most companies (or at least the ones doing things properly) will also have a long running retro test to see if impact persists (new test group = don't use the new changes).
I feel like it's especially bad for any UI changes that have relation to long-term productivity; measuring how given change affect existing users and whether the performance will go back to previous level or get below it after few weeks or month.