Hacker News new | ask | show | jobs
Most AB-tests will fail (jitbit.com)
82 points by m4xt3r 4742 days ago
17 comments

#1 Its good to see a post using statistically significant sample sizes. Most articles on this topic that I have read either include sample sizes so embarrassingly tiny it raises questions if the author even understood his statistics class (30 to 300) to simply not including numbers at all.

#2 If your audience is already pre-sold on your product when they visit your site, landing page changes won't be as meaningful. Possibly relevant to Jitbit, in this case.

#3 When you do conversion optimization, eventually you hit a number that simply becomes unbeatable. Stating the obvious, you can't convert at over 100%. Depending on traffic source/quality/intention you may find that ceiling to be lower, around 60-70%.

In reverse order.

#3. For real companies with completely unmotivated visitors, 60% conversion rates from visitor to paying customer is untenable. 5% is probably too much to hope for.

#2. Absolutely. A/B tests should be focused on actual balanced decision points. "Should I sign up?" "Should I open this email?" "Should I go back to this site?" But not on people who are already committed to doing what they were going to do anyways.

#3. On sample sizes, it is important to think carefully about the maximum effort you're willing to put into a test. Be very, very cautious about accepting test results that arrive early. No, 99% confidence is not enough to stop with 200 conversions, and 99.9% probably isn't either. Don't worry about statistical significance when you kill tests that have run too long to be worth continuing. This is a line of reasoning that is sadly rare in our industry. (I keep meaning to write my next article on that topic. But http://elem.com/~btilly/ab-testing-multiple-looks/part2-limi... explains one way to come up with such a strategy.)

Looks at OP's point about traffic source. Its likely that ones of your traffic sources converts way better than 5%, while unqualified people straight from google will convert much lower.
#2 is totaly relevant to our Macro Recorder app.

Macro Recorder is really popular and most of the visitors are already "pre-sold". They don't care about the landing page.

We used to run tests on that page, because it has 1k daily visitors, we thought that we could test ideas more quickly. That was before we realised, that we can't rely on the results.

We even used false conclusions, based on those results, to make changes to landing pages for other products. Perhaps the biggest mistake we've made.

This takeaway: "Don’t finish your tests before you have a statistically significant result" is potentially highly misleading.

See: http://www.evanmiller.org/how-not-to-run-an-ab-test.html

Just because the A/B test doesn't produce the results you expected doesn't mean it produced no results. Research scientists who don't prove their hypothesis don't say they learned nothing from the experiments, they just write a different paper.

Just to pick one example from this post; the whole-site redesign didn't change conversions. Did it change something else? More importantly, you now know that investing effort in a full re-design didn't really have any effect. Next time someone suggests changing the site, you have evidence that says it may not be worth the engineering cost, and maybe its better to focus somewhere else first.

> No company have ever increased the revenue by changing an insignificant detail like a button color. None. Zero. Get over it.

I'm sensitive to hyperbole, and I know he later refines his statement, but this is not true. Google Ads change what seem to be insignificant details by tiny amounts all the time (luminance on links, vertical space below the search bar, padding around sitelinks) and derive significant revenue changes.

I think one issue is that most sites' A/B tests results are difficult to distinguish from noise while Google has what you might call a significant user base.

I agree with you above statement. However, in regards to your example. I often wonder how much of the increase is due to Google overcoming ad blindness and how much is because of the ads are truly more engaging.
Good question. That's where longer-term measurement comes in. Overcoming ad blindness is a transitory thing that goes away after a period of time while engagement tends to stick (as measured by CTR, say).
True, most A/B tests fail.

However looking at the listed tests, they mostly seem to be general branding tests. Changing your branding, including fairly major redesigns, reliably fails to make a difference.

But lots of other kinds of changes reliably do make a difference. Big button. Putting signup link in multiple places. The word "Free". Streamlining your form. (No, you don't want to ask for first/last name.) Email headlines. Providing relevant results first.

However they make small differences. You need to plan on tens of thousands of successful conversions in your test. You can make this easier by running multiple tests at once. If you're a small company, you can make conversion be "to the next step in the conversion funnel" instead of "to the end". (But be warned, companies that test long enough and get large enough tend to find examples of cases where that short cut burned them.)

However it is very important to set expectations. Expect to find few big wins. Expect most to be modest wins. Keep track. For an established company, if A/B testing is adding 5-20% to your bottom line each year, it is totally worthwhile. (If you have traffic and have not been testing, then odds are that your first year will find more than a 20% boost. But this declines in future years.)

A/B testing is a valuable tool. It isn't magic pixie dust.

"No company have ever increased the revenue by changing an insignificant detail like a button color. None. Zero. Get over it."

This is not entirely true, if you have enough traffic, even the slightest change can make a difference. HOWEVER, for 99% of sites what you say is true. A/B testing should be used last, to squeeze out the last few percents of optimization, once you've done everything else. Not as a 'first step'.

One cool thing that can come of these tests, as an anecdote, is to see what is really working on a page, as opposed to what needs to be optimized.

We ran a Multivariate Test once for a signup form. The results were all similar, except for 1 which was much, much worse then the others. We could not understand why because there were no big differences between the elements, but this one particular combination set was having a bad time at it. Turns out, the headline was longer for one of the variations. This wrapped the text and pushed the call to action down below the fold, resulting in a ~30% (negative) difference in the conversion from the ones that were above the fold.

Those things happened (and still happen) a lot to us. At first we were all excited when there was a significant difference between alternatives, but now I just assume that something is wrong by default.
In my experience, sometimes a focus on A/B testing can extremely limit you.

How? Because sometimes you need a change in strategy, and A/B testing is more about changes in implementation of strategy.

For example, sometimes you need to do 'lead gen' instead of 'add to cart.'

Or, sometimes the answer to the problem is to spend a bunch of money on buying Facebook likes from Facebook, instead of tweaking your landing page design.

Or, better yet, sometimes the best strategy is to keep the landing page exactly the same, because it's already optimized, and focus on improving the product, so your customers will stick around longer.

That being said, always be ready to split test everything. And this means, always keep close track of your numbers.

Why? Because as your strategy shifts, you don't want to go blind. You'll have lots of implementation ideas, and you need to know which ones work best.

Interesting headline on the HN post. Joanna and I at Copy Hackers respectfully disagree. A properly-designed split test will ALWAYS produce a result -- you'll generate negative, positive, or neutral lift. And no matter what type of lift you achieve, you can ALWAYS learn from the test.

We just concluded an 11-site split test where we changed nothing but the home page headline -- written as a value proposition -- and we produced 9 of 11 "winning" tests, with an average conversion increase of 34% on the primary KPI (and the 34% includes the "losing" tests).

Bottom line is (from our many years of testing copy)... you have to change meaningful things to create conversion lift. And "meaningful" means in the eyes of your visitors.

Most A/B testing will fail when you're testing the wrong elements.

I have a challenge for you.

The examples you list are "top" of the funnel tests: buttons for engagement, a homepage redesign.

But when I go deeper into your funnel like your pricing grid page and cart checkout page, I see lots of tweaks that, in my experience, can drive the significant gains that you are looking for.

I'm going to use this page as my starting point, by way of example:

http://www.jitbit.com/hosted-helpdesk/purchase/

When I click order now on the "Startup" plan and get myself to the cart page here was my experience:

a) The load time on click to page load for checkout was >3 seconds for me. You can get some idea of why here http://tools.pingdom.com/fpt/ or here http://www.webpagetest.org/

b) Your design changes drastically between those pages, breaking the user's flow and attention. 37signals products do a nice job of maintaining conformity between pricing grid to checkout as an example.

c) You're treating a SAAS checkout like a product checkout - asking about quantity, list price, etc. What % of your users add multiple services to the cart? Can you keep this in more of a flow?

d) You have no visible security certificate on the page at all, when in fact even the placement of your security certificate on checkout can make a drastic impact in conversion eg http://www.conversionvoodoo.com/blog/2010/07/proper-placemen...

e) Yes, some people are still confused by what a CVV2 / CVC2 code is and displaying a visual explanation can help your conversions.

So my challenge for you is focus on the RIGHT area on your page to test - your highest intent traffic and I think you will find that a lot fewer cycles need be wasted and A/B testing remains one of the highest ROI activities your org can embark on.

Thanls a lot for taking your time to write this awesome comment!

The thing is we have very little control over the checkout area, since it is hosted by our payment provider and we cannot run tests there. We are not happy about it and we will move to Stripe as soon as it will be available outside of the US.

Really, thnaks a lot for this. I will look into this right now, maybe we could fix some of those.

EDIT: We've removed the quantity field from the SaaS products, never thought of that before, thanks.

I would also highly encourage you to look at Balanced or Stripe for checkout - the integration isn't too bad, the pricing is great and you have full control over the look and feel of your checkout pages.

Also, great tip on the CVV2 info, realized we didn't have it in our system either.

Braintree is available outside the US, and they offer the client-side JS to you can control the experience and dodge PCI requirements.
My pleasure and hope that it helps!

If you guys are doing decent scale on sales it's 100% worth building your own checkout process so you can really drill down your high-intent traffic, or at least testing stripe / etc (or whatever other solutions are available in your market)

I'm always resistant to putting up "trust logos" on a website. It's like training users to fall for phishing scams: you can trust me because I say you can trust me. Sure, the user can click the link to the verifying site, but how many do? Of those, how many would be able to spot it if you were redirecting them to a look-alike you set up?

I'd much rather train users to trust their SSL connection based on what their browser says rather than what I do.

I think this post and most posts on the topic are missing a major advantage of A/B testing. Sure, one of its uses is to experiment and try to keep optimizing everything. The other use of A/B testing is to prevent regressions in a statistically rigorous way.

Basically, if you can afford it, every change to the website could be A/B tested, and you'd know that the change doesn't negatively impact the conversion rate. Sometime's its really useful to know that you can safely change the button from red to a colour that matches your palette better, and "no change" is a great result.

I've been using this thinking recently to great advantage.

My basic strategy has been to make the design more beautiful, and if the conversion rate stays the same or improves, I've succeeded.

This way, I've been able to significantly improve the aesthetic appeal of many of my sites without risking the bottom line.

> No company have ever increased the revenue by changing an insignificant detail like a button color. None. Zero. Get over it.

This contradicts the article about A/B testing by Wired, which used the Obama campaign as an example.

http://www.wired.com/business/2012/04/ff_abtesting

Couple points there:

#1 The article is very vague - the Obama team may have actually made a very common mistake and stopped the tests early when they hit a significant level (they were in a big rush at the time, they had days to get everything working, not months). We don't know anything about samples sizes, etc.

#2 The main detail they talk about changing was changing from a movie to a still picture. This is a lot different than changing a button's color - many people will see a movie and close the page if they are in a quiet environment such as work or a library.

So it could very well contradict the A/B testing article by Wired and still be correct - or it could be wrong. Contradiction itself is not a red flag here in something as new as A/B web testing.

Alright, that was a bold statement on my side. That's true for 99.9% of the companies, not 100%.

Again, most of us don't have enough traffic to make successful buton color tests.

Have you tried MAB ? It seems like the best way to do A/B testing on small-traffic pages, let it run forever without losing much due to the reduced frequency of poorly performing variantes.

Setup your test with 3 or 4 colors, the programm will automatically pick the best one over time, so it doesn't matter how long you let it run, be it weeks months or years, and when you don't want to test it anymore you pick the one that performed the best out of the many visitors you've had over a large period.

http://en.wikipedia.org/wiki/Multi-armed_bandit

thank you!

i cannot tell you how excited i get when a client wants to split only to change a color of a button or a subtle change in a particular font. so frustrating!

Not quite sure I fully agree with this, having played myself with a/b tests for big box publicity, changing the message as lead to tremendous changes. Same for the website, conversions has changed a lot, not always for the better.

I do fully agree that buttons color really doesn't matter, which may explain OP new vs old website conversion rate, frankly it's about the same layout, without any different call to actions.

"No company have ever increased the revenue by changing an insignificant detail like a button color. None. Zero. Get over it."

I remember Josh the founder of Omniture said how they first got hired by ebay back in the day. Changing the color of the ebay homepage and button to yellow helped ebay increase sales by 15%. That slight increase on a heavily trafficked website resulted in millions of dollars in increased sales.

If all you test is the layout then, for the most part, you are Right! Right! Right!

But...

Even with little traffic, you can still achieve significant gains by focusing on the copy as opposed to the layout and colours.

I learnt this mostly with Adwords. At the prices they charge now, you just have to find the gains wherever they are.

Same principle applies to copy, especially to microcopy. If you change a couple of words in a headline, it rarely makes a difference.
Depends where the copy is. Changing a couple of words in an email subject line can and often does make a huge difference. Of course the fine line you walk is that some of the most effective words at driving behavior (Free! Now!) are also effective at getting you marked as spam.
I think wording in the subject line is important, because it is the only thing a user sees. There are no other factors that affect whether he will open it or not.
Using A/B testing and then measuring conversion rate skips a lot of useful steps.

A/B testing can be used to increase engagement with the site, which may result in a conversion days or weeks later.

Yes, but those things are impossible to track. If metrics changed weeks later, there is no way of knowing that it was caused by a test you run earlier.
It's easier and easier to track, ad retargeting works this way for example. But yes it is still harder to track accurately.
In our bizdev meeting yesterday we had a conversation that was very much along the lines of this post. It mirrors our experience over years of A/B testing completely.
This article is full of so much wrong I can't even finish reading it. Why is this on the first page.

Source: I do conversion optimization for a living.

Source: Your opinion. Perhaps you would like to elaborate and add to the conversation instead.
I'm sure, everyone here would love to hear what's wrong with it. No, really, if you do that for a living, you most certanly have something to add to the discussion.
Yeah..

If you don't have anything to give to the discussion, maybe you should not comment.