https://arxiv.org/pdf/1912.06719v1
And, arguably, Facebook's unsupervised pre-training for their multi-modal speech-to-text models is kind of the same idea as unsupervised pre-training for a multi-modal text-to-image diffuser.
https://ai.meta.com/research/publications/wav2vec-2.0-a-fram...