One intuition is that you can generate pairs which you know to be the “same thing” (a single example under heavy augmentation) and ensure they’re close in representation space whereas mismatched pairs are maximized in distance.
That’s a label-free approach which should give you a space with nice properties for eg nearest-neighbor approaches, and there’s, it follows, some reason to believe then that it’d be a generally useful feature space for downstream problems.
Note that most sample pairings, especially for images, is done through augmentations currently, so the implicit labeling you're doing is still weak on priors.
Of the methods mentioned in the article, BYOL (and even more the follow-up SimSiam [1]), have the weakest assumptions and work surprisingly well despite their simplicity.
I agree with Op that this is still essentially learning on labeled data.
I say this, since there are also cases of constrastive sampling like ideas with truly unsupervised data.
For example, Graph Embedding, where a graph implies structural features of similarity and distance that the representations should capture.
That’s a label-free approach which should give you a space with nice properties for eg nearest-neighbor approaches, and there’s, it follows, some reason to believe then that it’d be a generally useful feature space for downstream problems.