For an example with multimodal: https://www.marqo.ai/blog/generalized-contrastive-learning-f...
But the same approach works with text.