Bonus points: if you A/B test, you'll also learn to make improvements which go against the advice of the experts... and the "experts."
The first example which jumps to mind: "Sitewide navigation should be consistent" is a common UX bromide and is provably suboptimal with regards to conversions of interest, including signup to SaaS trials and success with checkout at e-commerce sites. (n.b. those are my results, and I've repeated each multiple times with consulting clients, but they may not be your results, which is why you want to A/B test.)
There are very few universal UX principles -- context almost always matters. The idea of universal UX principles is a holdover from the 80s when the field was still dominated by cognitive psychology approaches to HCI.
Having said that, UX is intended to help users achieve their goals. A/B testing is generally used to achieve business goals. Those two things are not necessarily always aligned.
Even better - have A/B testing and experts. They're good at understanding why some A/B tests come out certain ways, can point out issues that might need further investigation, and can help discover new tweaks and fixes that you can, in turn, A/B test.
(I know Patrick isn't saying this - but I hit the 'we don't need UX / we don't need usability tests - we A/B test!' thing quite a bit now and it's bloody annoying. Both because (a) yes you do, and (b) the folk saying it are usually doing a lousy job of A/B testing ;-)
Their diagnosis of why normal form elements faired better than the faux (custom) elements is lacking:
(1) There is no evidence to suggest the reason people didn't like the faux elements was because they weren't they defaults. They could have just been plain inferior. For example, the second faux element approximates a multi-select element, which opens when clicked like a familiar select dropdown, but which also allows users to select multiple choices. Whether this was done with a faux element or a native one, I can't imagine it ever performing better than checkboxes.
(2) One of the biggest differences between the faux elements and the default elements was the fact that, with the default elements, they included black labels above the elements. I know the point of the faux elements was to save space, but I can't help but wonder how they would have fared had they included labels as well.
(1) When I refer to standard form elements, I am actually referring to two separate things: the standard interaction for this particular type of option and the native element for supporting that interaction. I believe your argument is that we could have made custom elements that supported this standard action (e.g. checkboxes) that may have performed just as well or better than the native elements. While this is certainly possible, the point was intended to be more about the standard interaction and less about the native elements. Of course, the native elements are nice because they support these standard interactions and they are easy to maintain.
Although it wasn't shown in the blog post, the faux selects used (custom) checkbox buttons for selecting multiple options once opened.
(2) I originally intended to mention this, but ended up cutting that paragraph. As you point out, the faux elements were designed to save space, which is why they didn't have labels. Part of the standard form interaction is having the labels above elements, so that's what we did. However, in retrospect, I wish we had tested the label and element changes separately. I suspect each of them would have shown an increase in conversion.
One of the dangers of A/B testing that the author didn't discuss is actually measuring statistical validity. Simply split testing with a low sample size or low level of difference between versions could just be random chance. I find it is always important to figure out the statistical significance of your results to ensure its not just the roll of the dice.
This is a good point, and something we definitely consider. Our A/B testing tool, ABBA, performs the necessary statistics and is mentioned in the "Test everything" section of the post.
One of the dangers of A/B testing that the author didn't discuss is actually measuring statistical validity
Isn't that actually one of the problems of thinking you're doing A/B testing when you're actually not ;-) A proper analysis is an intrinsic part of what A/B testing is.
It's like doing TDD without the refactoring step. TDD without refactoring is, well, not TDD.
Because if you trust your gut, you will often be wrong. There's a reason why the most successful companies regularly A/B test.
We have seen some very significant (and surprising!) results at our startup from A/B testing in terms of registration rate, customer lifetime value and other key metrics.
The first example which jumps to mind: "Sitewide navigation should be consistent" is a common UX bromide and is provably suboptimal with regards to conversions of interest, including signup to SaaS trials and success with checkout at e-commerce sites. (n.b. those are my results, and I've repeated each multiple times with consulting clients, but they may not be your results, which is why you want to A/B test.)