Hacker News new | ask | show | jobs
by portman 3134 days ago
Isn't this particular test susceptible to confounding effects? Traffic fluctuates day-to-day, week-to-week, and month-to-month, so how can you be sure it was the presence-or-absence of ads and not something else? If you randomize at the visitor level, you are sampling from both high-and-low traffic days, and control for any external fluctuations.
1 comments

Because you're randomizing at the 2-day level, on average there will be just as many advertising/high-traffic days as advertising/low-traffic days, and as many no-advertising/high-traffic days as no-advertising/low-traffic days. The randomization is unaffected by traffic and uncorrelated with it. The unit of analysis is each day, not each visitor. This is why it has to be run for several months, otherwise you don't wind up with a decent n=50 pairs.

That's the tradeoff here: it lets you look at the totals, but it takes a lot longer than if you randomize per visitor in which case you could finish the test in a few days, often.

The stats went well over my head but as a web analyst I thought maybe you could have asked a simpler question. Pick a segment of your site such as visits from Google Search who landed on your homepage (most likely people who searched "gwern") which should reduce a lot of those spikes.
Subsetting will also increase the variance of each datapoint (consider the extreme case of picking a subset which was 0 or 1 visits per day), so is probably not a win. It's also hard to imagine what subset properly reflects all sources of traffic and so is informative about the total effect of advertising. Search queries definitely is not it.