Some really interesting work lately on "contrastive" learning, where the accuracy is really getting on par with supervised learning, e.g. https://arxiv.org/abs/2002.05709
.. CPC is .. translating a generative modeling problem to a classification problem... uses cross-entropy loss to measure how well the model can classify the “future” representation amongst a set of unrelated “negative” samples...
(one variant of) The task is: Given a crop of an image (or a short audio snippet etc.), can you find the matching crop that also comes from the same image from a set containing a lot of negative samples (crops of others, unrelated images)?
To succeed, the encoder needs to be able to extract the underlying, useful information (called slow features) contained in the patch and discard the noise as this will make the retrieval process much easier.
This yields an encoder that gives pretty good representations of your inputs and you can then finetune some additional layers on top of it for your final task.