1. Did you look at CLIP? it provides a common (to images & text) embedding.
2. Do your models need specialized training (vs. open models)?