Hacker News new | ask | show | jobs
by johnthewise 972 days ago
Which foundational model did you finetune with few images? just curious. I personally believe language interface or language conditioning is not very relevant or even harmful for many downstream CV applications. In your case, you don't need to ask whether the metal is bent in language interface, or there could be hundred ways you could ask these questions and outputs would be slightly different in each one. That's an unwanted instability, I feel conditioning inputs that were based on few examples would be much more relevant. i.e. Instead of conditioning with text embeddings, why not condition with embeddings of these 3 images and their labels?
1 comments

I used LLaVA. Unfortunately, I signed a NDA :( so I cannot share the code and the data is private. We fine-tuned it with example images, labels, and text prompts. We also tried in-context learning. Indeed, the prompt was static but we could do data augmentation and provide a series of equivalent prompts. We just used the prompt that gave us the best performance during initial model testing with in-context learning. I am unsure if the existence of equivalent prompts creates instability because a sentence with the same meaning should be quite close in the latent space of the foundation model so it understands them in a similar manner.