Models like Salesforce BLIP can be used to generate captions for images too - I built a little CLI tool far that here: https://github.com/simonw/blip-caption
CogVLM blows LLaVA out of the water, although it needs a beefier machine (quantized low-res version barely fits into 12GB VRAM, not sure about the accuracy of that).
I have no actual knowledge in this area so I'm not sure if it's entirely relevant but an update from the 7th of December on the CogVLM repo says it now works with 11GB of VRAM.