|
|
|
|
|
by johnthewise
972 days ago
|
|
Which foundational model did you finetune with few images? just curious. I personally believe language interface or language conditioning is not very relevant or even harmful for many downstream CV applications. In your case, you don't need to ask whether the metal is bent in language interface, or there could be hundred ways you could ask these questions and outputs would be slightly different in each one. That's an unwanted instability, I feel conditioning inputs that were based on few examples would be much more relevant. i.e. Instead of conditioning with text embeddings, why not condition with embeddings of these 3 images and their labels? |
|