Hacker News new | ask | show | jobs
by abeppu 5183 days ago
The "test everything" mantra sounds good, but in practice, you generally have only so much data you can afford (in impressions per day, or whatever), and when your CTRs are often 0.1% or lower, you need quite a lot of data to get narrow confidence intervals around your CTRs. Using the basic binomial model, if you have two test conditions, one of which actually does 20-25% better than the other, (say, 0.11% versus 0.09%), your confidence intervals will keep overlapping until you have OOM 1M impressions. This is all just to say that running a whole lot of tests can quickly become expensive an impractical.

While testing some radical, weird treatments can give you valuable perspective, or shed light on the assumptions you've been making, testing every idea is rarely feasible. I would not, for instance, guess that that the author should test different versions of the second ad with colors or number of exclamation points changed.

3 comments

As someone who has spent $100,000 on advertising over the last few years, at least $10,000 of which was on the Plenty of Fish platform, you're very wrong.

You only need about 1,000-10,000 impressions to get an idea of how a creative performs. Often less. As you get more and more used to each particular advertising platform, you also get a feel for how an ad is performing.

In my business, a difference of 0.02% CTR could mean the difference between earning 30% ROI and 50% ROI - the words "test everything!" mean everything to me and my results.

Yes, test everything. But what exactly do you mean by "get a feel"?

Suppose your CTR is known to be either 0.09% or 0.11%, you've had 10,000 impressions, and you've got 11 clickthroughs. (This is of course the most likely number if your CTR is actually 0.11%.) The likelihood ratio between the two possibilities is about 0.81. So if you thought those two possibilities were equally probable before, you should now think it's about 55% likely that the CTR is 0.11% and about 45% likely that it's 0.09%.

So if by "get an idea" you mean something much stronger than 55%:45% then I fear you may be fooling yourself, no matter how much you've spent on advertising over the last few years. (Whether that means you should reconsider "test everything" depends on the costs of testing -- the actual cost of doing it, and the cost in running something other than the currently-believed-best version.)

And with 1000 impressions? Forget it. You expect to see 0.9 clickthroughs on average with a 0.09% CTR and 1.1 on average with a 0.11% CTR. You can probably get some extra information from (e.g.) when that expected single clickthrough happens, but it's not going to take you near to that 55:45 ratio. (I might believe 52:48.)

Both sides are right here.

First of all I don't care how much you've spent on advertising, or how much experience you have. Human brains are hardwired to see patterns that don't really exist. If you haven't done the math then I guarantee that your intuition for what matters is way wrong.

However going the other way, people take standards from the science world into A/B testing that are not appropriate. If you're testing a ton of ideas, getting the right answer 3x out of 4, and not going far wrong most of the rest of the time is a pretty good result. It is certainly a lot better than concluding that it is too hard to test those ideas at all.

But if you have a specific idea you want tested, or if you want all ideas tested to a certain confidence, then you really, really need to either do the math, or to get someone to do the math for you. Because the human brain is a pattern finding machine, and you want real answers, not made up stuff.

I think part of the point abeppu is making is whether or not the two the statistics for each advertisment type are different. You see a difference of a 0.02% CTR but is that significant?

An analogy (from an interesting post I can't find) is tossing a coin for two samples sets of a 1000 times. In one sample I wear a read jumper and in another I wear a green jumper. I find that the green jumper gives me a 0.02% improvement in producing heads. Therefore I will always wear my green jumper in future when I toss a coin.

Obviously however this is just random error. A statistical analysis of CTR will tell you if the difference between the two advertisments you are observing is significant. The larger the samples size, as abeppu wrote, the greater the greater confidence in your results.

I agree with your point, however I would like to highlight your use of confidence intervals:

"If two statistics have non-overlapping confidence intervals, they are necessarily significantly different but if they have overlapping confidence intervals, it is not necessarily true that they are not significantly different."

http://www.cscu.cornell.edu/news/statnews/stnews73.pdf

In particular, it doesn't scale well at all with the number of choices (the familiar curse of dimensionality). You can reasonably test two variants, but if your design could plausibly vary along, say, 10 axes (not uncommon), you're going to have trouble collecting sufficient data to cover the whole 10-dimensional theoretical design space. So data-driven design can usually only be applied to a small part of the design space, typically testing a small number of alternatives.