This may be a dumb question, but would it be possible to apply these techniques to something like text completion and/or visual question answering? If you went ahead and used the optimizations but still scaled the model up?
Yes, we are training text embedding models right now. And also have plans to open-source some of them!
In addition, we train encoders for different modalities with retrieval purposes. For example, video data.