You could offset the "tap the first picture" effect by e.g. randomizing the order the pictures are shown, weighting the votes of serial "top clickers" less, etc.
Then you're just applying random votes, instead of actual votes.
For example, something that should be 15-2 votes, ends up being 95-82. You'll be applying a large number of votes to both sides, and pushing everything towards a 50/50 rating. This doesn't help anyone, the goal is A/B testing, and you're making it more difficult to get accurate data. 15-2 shows a lot of promise for A, but then you add 80 random votes to both sides, and 95-82 seems like a tie.
Well, I didn't really outline a specific approach, I was just trying to suggest that there are techniques you could apply to the data after collection to solve the problem without modifying the fundamental premise of the app. For example if you "weighted the votes of the serial top clickers less" in addition to randomizing the order of the pictures shown, then when all of your serial top clicker votes count as 0.1 instead of 1 then suddenly your 95-82 might move closer to the 15-2 and leave you with something like 25-12 which isn't as clear as 15-2 but is clearly significant.
Instead of showing the user 95-82, or 25-12 you tell the user "others prefered shirt A 2-to-1 or shirt b 66% to 33% or whatever might be appropriate.
Anyway, again, I'n not suggesting any particular techniques are the right ones, just that there (almost certainly) viable techniques available to mitigate the problem.
For example, something that should be 15-2 votes, ends up being 95-82.
It's all in the presentation. If you just highlight one image and stamp "WINNER" next to it, most people won't even look at the numbers. Crowning a winner is more important than being scientifically accurate.
For example, something that should be 15-2 votes, ends up being 95-82. You'll be applying a large number of votes to both sides, and pushing everything towards a 50/50 rating. This doesn't help anyone, the goal is A/B testing, and you're making it more difficult to get accurate data. 15-2 shows a lot of promise for A, but then you add 80 random votes to both sides, and 95-82 seems like a tie.