| HN Mirror

I used LLaVA. Unfortunately, I signed a NDA :( so I cannot share the code and the data is private. We fine-tuned it with example images, labels, and text prompts. We also tried in-context learning. Indeed, the prompt was static but we could do data augmentation and provide a series of equivalent prompts. We just used the prompt that gave us the best performance during initial model testing with in-context learning. I am unsure if the existence of equivalent prompts creates instability because a sentence with the same meaning should be quite close in the latent space of the foundation model so it understands them in a similar manner.