There's no 'rigorous comparison' that puts CNNs over Vits in quality and Vits unlocked more use cases easier than CNNs did. That's why they're more popular, not because it's 'bandwagon-y'.
What's the use case enabled vs running a ConvNeXt or EfficientNetV2 and using the resulting strided features as you would the resulting tokens of a ViT? I'm not saying that ViT is worse. Just saying that the scholarship around comparing them is very bad or nonexistent. You have to properly tune the hyperparam enters on both sides in a fair way, and use all the general modern training tricks also on the CNN side to make it fair.